Page 1 of 1

Reference Match Input Considerations

Posted: Mon Mar 04, 2013 10:51 am
by ymadden@bmi.com
We're putting together our first Quality Stage match job and using product type data. For p.o.c. purposes, we've been using datasets as standardization/frequency output and ref. match input. During the match job, it seems that the match stage is waiting for the entire reference file to be read into memory before doing any matching. Is this really the case? If we were using a database as ref match input, would it attempt to read the entire reference table into memory before blocking and matching?

Posted: Mon Mar 04, 2013 7:56 pm
by stuartjvnorton
When you created the match frequency job, did you specify your match spec in the match frequency stage?
If you did, it only gets match frequencies for the fields you are matching on.
If you didn't, it gets match frequency data on all fields (including the primary key) so your match frequency data will be largely overhead.

Posted: Tue Mar 05, 2013 9:11 am
by ymadden@bmi.com
Yeah. The RefFreq input is pretty small (<200). But our entire Reference input will eventually be on the order of several 100 Million rows.

When the match runs with a database connector as reference input -
Does it (select * from reference_table) to a temp dataset or memory or something and then block and match against that?
Or does it (select * from reference_table where block_conditions_are_met) into memory/dataset and match against that?

Posted: Tue Mar 05, 2013 10:15 am
by boxtoby
When performing reference matches it is a good idea to reduce the amount of data in the reference set, if possible. I don't know your data but, for example, if you were using address data there is no point in loading reference addresses for London when the source only contains addresses from Edinburgh, Swindon and Bristol.

Hope that makes sense.

Cheers,
Bob.

Posted: Tue Mar 05, 2013 4:44 pm
by stuartjvnorton
ymadden@bmi.com wrote:Yeah. The RefFreq input is pretty small (<200). But our entire Reference input will eventually be on the order of several 100 Million rows.

When the match runs with a database connector as reference input -
Does it (select * from reference_table) to a temp dataset or memory or something and then block and match against that?
Or does it (select * from reference_table where block_conditions_are_met) into memory/dataset and match against that?
OK, now I get what you're after.
By default, it will load your whole reference set. Why wouldn't it? You're the one telling the job what to do, after all. It also needs to do it in advance: it doesn't want to miss potential matches because the reference set is incomplete.

If you want to limit the size of the reference set, you would need to filter the reference set yourself. IIRC, this is pretty common for SOA reference match jobs, so you should be able to pick up some tips from the ISD guides.

Posted: Tue Mar 05, 2013 4:56 pm
by stuartjvnorton
Also, if you are using columns in the match spec with large cardinality, like a social security number or a location id, go to Configure Specification ->Variable Special Handling in the match designer, and set the NOFREQ action to those fields.
That will exclude them from the match frequency generation and greatly reduce the size of the match frequency file.

Posted: Thu Mar 07, 2013 9:20 am
by ymadden@bmi.com
Thanks guys, that all makes sense.

As a follow up - We're generating several match key columns through a standardization job that are solely used for blocking and matching (NYSIIS, Acronyms, tokens, etc.). Should these match keys be persisted in the database alongside the data, or should these be generated in flight as the reference data is read from the database?

Posted: Thu Mar 07, 2013 1:35 pm
by ray.wurlod
Persist them in the database table, and index them.

Posted: Thu Mar 07, 2013 3:21 pm
by ymadden@bmi.com
But what's the point of indexing them if QS does not block and match against the database?

Posted: Thu Mar 07, 2013 11:56 pm
by ray.wurlod
A reference match can be (and often is) against the database.

Posted: Fri Mar 08, 2013 8:38 am
by ymadden@bmi.com
Our plan is to run the match against the database. So this brings me back to my original question - Assume I run a match job with a DB Connector as the reference input. Does the match job issue one single select statement at the beginning of the run to pull the entire reference set into scope for blocking and matching? Or does the job issue a separate select statement (with blocking conditions in a where clause) for each record in the data set?

Posted: Sat Mar 09, 2013 2:27 am
by stuartjvnorton
And if you read my response, you'd know that it reads it all, unless you filter it yourself.
Like I also said, look at the guides for Internet Services Director and Real Time jobs, and you will see how they do exactly what you were asking for.

Posted: Sat Mar 09, 2013 12:09 pm
by ymadden@bmi.com
Point taken. Thanks. We still have a lot to learn about this platform.