Page 1 of 1

Match Frequency

Posted: Thu Nov 26, 2015 2:44 pm
by samyamkrishna
Hi All,

One of our ETL batch is running for 8 hours.
This has 6 Match jobs in it.

The match uses Unduplicate match and the matches are for Individual, Org , Address, Phone etc.
Each of them run for more than hour.

But the Frequency file is generated from a row generator with 1 row and all the columns thats required for all the six Match Jobs.

My question is.

If we create actual frequency files using the data instaed of a row generator.
The frequency file will have more details than the present frequency file.

Will this help in improving the performance of the match jobs because it has actual data rather than a dummy frequency file?

Regards,
Samyam

Posted: Fri Nov 27, 2015 2:47 pm
by ray.wurlod
Define what you mean by "performance" in this context. Certainly generating frequencies will generate more accurate results (for a large enough sample) than an artificially flat frequency distribution.

Posted: Fri Nov 27, 2015 2:52 pm
by samyamkrishna
Hi Ray,

I bought the premium membership yesterday 26th Nov.
But i am still not able to see the content you posted usder premium.

I got a mail from rick stating that i will get another mail of confirmation.
But how long do you think it will take to get this membership.

Should i also contact editor@ liek in one of the recent posts.

Regards,
Samyam

Posted: Fri Nov 27, 2015 2:54 pm
by ray.wurlod
Wait till the weekend is over.

It won't do any harm to contact editor@dsxchange.net, but these people actually have a life as well as running DSXchange.

Posted: Sun Nov 29, 2015 10:46 pm
by stuartjvnorton
Define "performance". Number of matches, "quality" of matches (lower false positives / negatives), execution time?

As Ray said, the quality of score may improve a little by using accurate frequency data.

As for execution time, the number of records and number of match passes would be the first things to look at for understanding how much time is reasonable to expect it to take, and where you might be able to improve the times.
Also note that it takes time to create match frequency files.

Posted: Mon Nov 30, 2015 9:00 am
by rjdickson
Take a look at your match specification. If you are using overrides for every column, then generating frequencies will not matter as the overrides can take priority.

Is the original question based on curiosity, or are you having quality issues in your matching?

Posted: Mon Nov 30, 2015 12:20 pm
by samyamkrishna
Stuart,

I am worried about the execution time.
Thanks for giving those hints on what to look at.

Will look at them to get to a conclusion.


rjdickson,

Yes there are overides. but i am not sure if its for all the columns.
will check that too.

The question is not based on curiosity. we are having issues with the run times due to a short execution window on production.

Posted: Mon Nov 30, 2015 12:25 pm
by rjdickson
The most frequent cause of bad match performance is (arguably) blocking fields that are too 'loose' (include too many candidate records).

Do you know what pass is causing issues? (Blocking fields are per pass).

The next thing you can look at is the job design. I would assume there is some sort of read from a database for the reference link. Does that read have a 'where' clause, and if so, is the column(s) used in the where clause indexed?

Posted: Fri Dec 04, 2015 3:21 pm
by samyamkrishna
I dont have access to director on Prod.
Try to get the access.

Will post my findings once i get hold of the logs.