Hi All,
One of our ETL batch is running for 8 hours.
This has 6 Match jobs in it.
The match uses Unduplicate match and the matches are for Individual, Org , Address, Phone etc.
Each of them run for more than hour.
But the Frequency file is generated from a row generator with 1 row and all the columns thats required for all the six Match Jobs.
My question is.
If we create actual frequency files using the data instaed of a row generator.
The frequency file will have more details than the present frequency file.
Will this help in improving the performance of the match jobs because it has actual data rather than a dummy frequency file?
Regards,
Samyam
Match Frequency
-
- Premium Member
- Posts: 258
- Joined: Tue Jul 04, 2006 10:35 pm
- Location: Toronto
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Define what you mean by "performance" in this context. Certainly generating frequencies will generate more accurate results (for a large enough sample) than an artificially flat frequency distribution.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 258
- Joined: Tue Jul 04, 2006 10:35 pm
- Location: Toronto
Hi Ray,
I bought the premium membership yesterday 26th Nov.
But i am still not able to see the content you posted usder premium.
I got a mail from rick stating that i will get another mail of confirmation.
But how long do you think it will take to get this membership.
Should i also contact editor@ liek in one of the recent posts.
Regards,
Samyam
I bought the premium membership yesterday 26th Nov.
But i am still not able to see the content you posted usder premium.
I got a mail from rick stating that i will get another mail of confirmation.
But how long do you think it will take to get this membership.
Should i also contact editor@ liek in one of the recent posts.
Regards,
Samyam
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Wait till the weekend is over.
It won't do any harm to contact editor@dsxchange.net, but these people actually have a life as well as running DSXchange.
It won't do any harm to contact editor@dsxchange.net, but these people actually have a life as well as running DSXchange.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 527
- Joined: Thu Apr 19, 2007 1:25 am
- Location: Melbourne
Define "performance". Number of matches, "quality" of matches (lower false positives / negatives), execution time?
As Ray said, the quality of score may improve a little by using accurate frequency data.
As for execution time, the number of records and number of match passes would be the first things to look at for understanding how much time is reasonable to expect it to take, and where you might be able to improve the times.
Also note that it takes time to create match frequency files.
As Ray said, the quality of score may improve a little by using accurate frequency data.
As for execution time, the number of records and number of match passes would be the first things to look at for understanding how much time is reasonable to expect it to take, and where you might be able to improve the times.
Also note that it takes time to create match frequency files.
-
- Premium Member
- Posts: 258
- Joined: Tue Jul 04, 2006 10:35 pm
- Location: Toronto
Stuart,
I am worried about the execution time.
Thanks for giving those hints on what to look at.
Will look at them to get to a conclusion.
rjdickson,
Yes there are overides. but i am not sure if its for all the columns.
will check that too.
The question is not based on curiosity. we are having issues with the run times due to a short execution window on production.
I am worried about the execution time.
Thanks for giving those hints on what to look at.
Will look at them to get to a conclusion.
rjdickson,
Yes there are overides. but i am not sure if its for all the columns.
will check that too.
The question is not based on curiosity. we are having issues with the run times due to a short execution window on production.
Cheers,
Samyam
Samyam
The most frequent cause of bad match performance is (arguably) blocking fields that are too 'loose' (include too many candidate records).
Do you know what pass is causing issues? (Blocking fields are per pass).
The next thing you can look at is the job design. I would assume there is some sort of read from a database for the reference link. Does that read have a 'where' clause, and if so, is the column(s) used in the where clause indexed?
Do you know what pass is causing issues? (Blocking fields are per pass).
The next thing you can look at is the job design. I would assume there is some sort of read from a database for the reference link. Does that read have a 'where' clause, and if so, is the column(s) used in the where clause indexed?
Regards,
Robert
Robert
-
- Premium Member
- Posts: 258
- Joined: Tue Jul 04, 2006 10:35 pm
- Location: Toronto