DSXchange

Posted: **Thu Nov 08, 2007 11:02 pm**

Hi all,

I am using Quality stage (8.0 hawk) for formatting address data. I ran through the various stages (investigate, standardize, match frequency, unduplicate and survival)

Data is about 500000 records investigate standardize took 40 mins each
match freequency took about 45 mins and unduplicate do the same. But while i am doing the survial it is taking almost more than four hours to run only 50000 records and i killed the job. Can any one suggest me how to improve the performance as well as minimise the duration.

Nature of data (investigation)

rec1 rec2 rec3 rec4
jnm gg 1 number, street
jnm gg 2 city, state.

Standardization ( USPREP, USNAME, USADDR,USAREA)
match frequency
unduplication match (created match specification and pass) selected match, duplicate and m probability .9 and u probability .1.
standardize (qsMatchType="MP" (Matched patten) selected.

Thanks for the suggestions.

Posted: **Fri Nov 09, 2007 4:56 am**

Firstly, check the size of your match groups on the match output report in the data folder of the project. Go to the bottom of the report where this information is held. They should be less than 100 although it depends on the data, machine size etc etc. You are looking for a very large maximum group size.

Then, what are your survival rules? Are they simple as in only survive the "XA" master record or are have you got something more complicated?

A combination of large match groups and complex survival rules may be the problem.

If your rules are complex then, as a test, try making them simple and see if that helps.

Also a trawl through the logs in the logs folder can sometimes reveal what is happening.

Hope this helps.

Posted: **Fri Nov 09, 2007 8:41 am**

Thanks Boxtoby for the response...

As i mentioned in my previous post i am using HAWK QS 8 version. I am not sure weather it creates a report file as of VER 7.5 (projec/data) folder.

My survival rules are simple i am just bringing the master records with no transformations or any complex rules

for ex:

rec1 most frequent(nonblank) (as a rule) rec1.

Thanks for the suggestion.

Posted: **Fri Nov 09, 2007 11:24 am**

Hmm!

If you can I would try and find the logs and match report file if only because support will probably ask for them if you go down the help desk route which is looking likely!

An alternative approach might be to actually implement the survivorship in DS. It's not too difficult and easier to support possibly in the longer term.

Posted: **Fri Nov 09, 2007 3:54 pm**

I will definitely look for the logs and report files if they are in my project directory..

Ok Let me clear out the jobs that i created.. please let me know if any changes that i need to make.

job 1): sequential file 1---> investigate stage ...> sequential file2

job 2) : sequential file 1 ---> standardize ---> seq file 3

job 3) : seq file 3 ------> match_frequency ----> seq file 4

job 4) : seq file 3, seq file 4 ----> unduplicate match ----> matched rec (seq file 5),unmatched records(seq file6)

job 5) : seq file5 (matched records) ----> survive stage ---> seq file (7)

Thanks once again for your suggestion.

Posted: **Fri Nov 09, 2007 7:41 pm**

Sorry this is the odrer from third job.

job 3) : seq file 3 ------> match_frequency ----> Dataset

job 4) : seq file 3, dataset ----> unduplicate match ----> matched rec (seq file 5),unmatched records(seq file6)

job 5) : seq file5 (matched records) ----> survive stage ---> seq file (7)

DSXchange

Quality Stage Performace Issue

Quality Stage Performace Issue