Match specification

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
kennyapril
Participant
Posts: 248
Joined: Fri Jul 30, 2010 9:04 am

Match specification

Post by kennyapril »

A job is designed which has reference match stage used in it.
This reference match stage has a match specification which has 3 passes in it.

Now how can I test the passes to find the match hit rate.
Do I need to run the job or is there any other procedure to run the passes or find the hit rate.

can anyone help me with this?
Regards,
Kenny
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The match specification designer allows you to place passes into a "holding area" so that particular combinations of match passes may be tested. Testing is done within that designer; you don't need formally to run the job. And you can specify a sample of data if you wish.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kennyapril
Participant
Posts: 248
Joined: Fri Jul 30, 2010 9:04 am

Post by kennyapril »

Yes,I can test the passes now.

There are 3 passes in the match specification.

I can see the pass statistics, How do I know the hit rate is more because as of now in the pass statics I see that 85% is residuals, 10% matched and 5% clerical.

If the hit rate is more is that match % would be more?
please clarify my query
Regards,
Kenny
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Depends what you mean by "hit". I assume you mean "match". In that case, the match percentage is a measure of hit rate. But some of the clericals may end up being true matches once reviewed (you won't know till that's been done). "Match" in this context is the percentage of records that exist in a set with a master and at least one duplicate.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kennyapril
Participant
Posts: 248
Joined: Fri Jul 30, 2010 9:04 am

Post by kennyapril »

Yes,hit rate is nothing but match rate.

What I see in the 3 passes is the residuals have more % and next clericals and then match.

So to increase the match rate,The only thing to do is choose blocking columns and matching commands with a match cutoff.

Is the above conclusion right?

Also to check the match rate I need to see the pass statistics or Is there any other way to check the match rate?
Regards,
Kenny
kennyapril
Participant
Posts: 248
Joined: Fri Jul 30, 2010 9:04 am

Post by kennyapril »

Source has 3 million records and the reference has 0.2 million records.

The source and reference along with their frequency's have been matched using reference match with a match specification which has 3 passes.

The outputs I used for the match reference are matched and clerical.

After I run the job I see that the matched records are 89,600 and the clerical records are 14,200 but when I see the pass statistics there I see a different figure like
pass1, Pass2, Pass3

Blocks processed:- 384, 57373, 25966

Matched pairs:-569,2631, 18

Exact matched pairs:-430,0, 0

Clerical:-0,59468,9

Reference duplicates:-592, 44951, 11

Exact reference 407, 0, 0
duplicates:-

Data residuals:- 207267, 124382, 204633


Why are the records different in the job and statistics?

please suggest if any changes are required for improving the match rate or count ??
Regards,
Kenny
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

kennyapril wrote:Please suggest if any changes are required for improving the match rate or count ??
The match rate is primarily driven by your data. It may be impossible to "improve" it. If the match specifications are correct, the matches in the data (and only the matches in the data) will be detected.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kennyapril
Participant
Posts: 248
Joined: Fri Jul 30, 2010 9:04 am

Post by kennyapril »

Earlier the matched records were only 20,000 but after I changed the blocking columns and the matching commands the matched records increased to 89,000.

But these matched records are not equal to the matched records in the passes.

Why does this happen?
Regards,
Kenny
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

No idea without a lot more information about your choices of blocking fields and match rules!
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kennyapril
Participant
Posts: 248
Joined: Fri Jul 30, 2010 9:04 am

Post by kennyapril »

The blocking columns in the three passes were
1.NYSIISfirstname,last_name,zip
2.License_num,License_state,gender
3.last_name,gender,first_name

The matching commands contain the other columns
firstname,middlename,street_no,first_init,gender,last_name

this is reference match and has onetomany multiple type.

any changes needed for the blocking columns or matching commands?
Regards,
Kenny
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

Try making your blocking a little looser.
Remember, blocking is exact, so anything that might have scored ok through inexact matches won't get in because blocking kept it out.
Experiment a bit with your blocking and matching fields. Some choices will be better than others, but give it a go: probabilistic matching is not a cookie cutter exercise.

And I'm sure you did this, but I have to ask: did you standardise the data first? It can make a huge difference.
kennyapril
Participant
Posts: 248
Joined: Fri Jul 30, 2010 9:04 am

Post by kennyapril »

So, just play around with the blocking and matching criteria.


Yes, I standardized the data
Using USNAME---->firstname,lastname
Using USAREA---->city,state,zip
UsingUSADDR----->addr1,addr2,addr3

After standardizing I used match frequency for the frequency data for source and reference.

But what happened is the source frequency records were 3500 and the reference frequency records were 4000.

After this I designed a job with reference match and used a match specification and when I run the reference match job the source frequency and reference frequency records are doubled.
i.e 7000 and 8000 come as input to reference match.

why does this happen?

Is this usual or unusual?

Thanks,
Regards,
Kenny
Post Reply