Match Frequency

samyamkrishna · Post by **samyamkrishna** » Thu Nov 26, 2015 2:44 pm

Hi All,

One of our ETL batch is running for 8 hours.
This has 6 Match jobs in it.

The match uses Unduplicate match and the matches are for Individual, Org , Address, Phone etc.
Each of them run for more than hour.

But the Frequency file is generated from a row generator with 1 row and all the columns thats required for all the six Match Jobs.

My question is.

If we create actual frequency files using the data instaed of a row generator.
The frequency file will have more details than the present frequency file.

Will this help in improving the performance of the match jobs because it has actual data rather than a dummy frequency file?

Regards,
Samyam

ray.wurlod · Post by **ray.wurlod** » Fri Nov 27, 2015 2:47 pm

Define what you mean by "performance" in this context. Certainly generating frequencies will generate more accurate results (for a large enough sample) than an artificially flat frequency distribution.

samyamkrishna · Post by **samyamkrishna** » Fri Nov 27, 2015 2:52 pm

Hi Ray,

I bought the premium membership yesterday 26th Nov.
But i am still not able to see the content you posted usder premium.

I got a mail from rick stating that i will get another mail of confirmation.
But how long do you think it will take to get this membership.

Should i also contact editor@ liek in one of the recent posts.

Regards,
Samyam

ray.wurlod · Post by **ray.wurlod** » Fri Nov 27, 2015 2:54 pm

Wait till the weekend is over.

It won't do any harm to contact editor@dsxchange.net, but these people actually have a life as well as running DSXchange.

stuartjvnorton · Post by **stuartjvnorton** » Sun Nov 29, 2015 10:46 pm

Define "performance". Number of matches, "quality" of matches (lower false positives / negatives), execution time?

As Ray said, the quality of score may improve a little by using accurate frequency data.

As for execution time, the number of records and number of match passes would be the first things to look at for understanding how much time is reasonable to expect it to take, and where you might be able to improve the times.
Also note that it takes time to create match frequency files.

rjdickson · Post by **rjdickson** » Mon Nov 30, 2015 9:00 am

Take a look at your match specification. If you are using overrides for every column, then generating frequencies will not matter as the overrides can take priority.

Is the original question based on curiosity, or are you having quality issues in your matching?

samyamkrishna · Post by **samyamkrishna** » Mon Nov 30, 2015 12:20 pm

Stuart,

I am worried about the execution time.
Thanks for giving those hints on what to look at.

Will look at them to get to a conclusion.

rjdickson,

Yes there are overides. but i am not sure if its for all the columns.
will check that too.

The question is not based on curiosity. we are having issues with the run times due to a short execution window on production.

rjdickson · Post by **rjdickson** » Mon Nov 30, 2015 12:25 pm

The most frequent cause of bad match performance is (arguably) blocking fields that are too 'loose' (include too many candidate records).

Do you know what pass is causing issues? (Blocking fields are per pass).

The next thing you can look at is the job design. I would assume there is some sort of read from a database for the reference link. Does that read have a 'where' clause, and if so, is the column(s) used in the where clause indexed?

samyamkrishna · Post by **samyamkrishna** » Fri Dec 04, 2015 3:21 pm

I dont have access to director on Prod.
Try to get the access.

Will post my findings once i get hold of the logs.