Match specification

kennyapril · Post by **kennyapril** » Tue Feb 08, 2011 8:59 am

A job is designed which has reference match stage used in it.
This reference match stage has a match specification which has 3 passes in it.

Now how can I test the passes to find the match hit rate.
Do I need to run the job or is there any other procedure to run the passes or find the hit rate.

can anyone help me with this?

ray.wurlod · Post by **ray.wurlod** » Tue Feb 08, 2011 12:07 pm

The match specification designer allows you to place passes into a "holding area" so that particular combinations of match passes may be tested. Testing is done within that designer; you don't need formally to run the job. And you can specify a sample of data if you wish.

kennyapril · Post by **kennyapril** » Tue Feb 08, 2011 7:33 pm

Yes,I can test the passes now.

There are 3 passes in the match specification.

I can see the pass statistics, How do I know the hit rate is more because as of now in the pass statics I see that 85% is residuals, 10% matched and 5% clerical.

If the hit rate is more is that match % would be more?
please clarify my query

ray.wurlod · Post by **ray.wurlod** » Tue Feb 08, 2011 10:32 pm

Depends what you mean by "hit". I assume you mean "match". In that case, the match percentage is a measure of hit rate. But some of the clericals may end up being true matches once reviewed (you won't know till that's been done). "Match" in this context is the percentage of records that exist in a set with a master and at least one duplicate.

kennyapril · Post by **kennyapril** » Wed Feb 09, 2011 8:52 am

Yes,hit rate is nothing but match rate.

What I see in the 3 passes is the residuals have more % and next clericals and then match.

So to increase the match rate,The only thing to do is choose blocking columns and matching commands with a match cutoff.

Is the above conclusion right?

Also to check the match rate I need to see the pass statistics or Is there any other way to check the match rate?

kennyapril · Post by **kennyapril** » Sat Feb 12, 2011 2:59 pm

Source has 3 million records and the reference has 0.2 million records.

The source and reference along with their frequency's have been matched using reference match with a match specification which has 3 passes.

The outputs I used for the match reference are matched and clerical.

After I run the job I see that the matched records are 89,600 and the clerical records are 14,200 but when I see the pass statistics there I see a different figure like
pass1, Pass2, Pass3

Blocks processed:- 384, 57373, 25966

Matched pairs:-569,2631, 18

Exact matched pairs:-430,0, 0

Clerical:-0,59468,9

Reference duplicates:-592, 44951, 11

Exact reference 407, 0, 0
duplicates:-

Data residuals:- 207267, 124382, 204633

Why are the records different in the job and statistics?

please suggest if any changes are required for improving the match rate or count ??

ray.wurlod · Post by **ray.wurlod** » Sat Feb 12, 2011 3:30 pm

kennyapril wrote:Please suggest if any changes are required for improving the match rate or count ??

The match rate is primarily driven by your data. It may be impossible to "improve" it. If the match specifications are correct, the matches in the data (and only the matches in the data) will be detected.

kennyapril · Post by **kennyapril** » Mon Feb 14, 2011 11:06 am

Earlier the matched records were only 20,000 but after I changed the blocking columns and the matching commands the matched records increased to 89,000.

But these matched records are not equal to the matched records in the passes.

Why does this happen?

ray.wurlod · Post by **ray.wurlod** » Mon Feb 14, 2011 3:26 pm

No idea without a lot more information about your choices of blocking fields and match rules!

kennyapril · Post by **kennyapril** » Tue Feb 15, 2011 1:20 pm

The blocking columns in the three passes were
1.NYSIISfirstname,last_name,zip
2.License_num,License_state,gender
3.last_name,gender,first_name

The matching commands contain the other columns
firstname,middlename,street_no,first_init,gender,last_name

this is reference match and has onetomany multiple type.

any changes needed for the blocking columns or matching commands?

stuartjvnorton · Post by **stuartjvnorton** » Tue Feb 15, 2011 4:47 pm

Try making your blocking a little looser.
Remember, blocking is exact, so anything that might have scored ok through inexact matches won't get in because blocking kept it out.
Experiment a bit with your blocking and matching fields. Some choices will be better than others, but give it a go: probabilistic matching is not a cookie cutter exercise.

And I'm sure you did this, but I have to ask: did you standardise the data first? It can make a huge difference.

kennyapril · Post by **kennyapril** » Wed Feb 16, 2011 6:27 pm

So, just play around with the blocking and matching criteria.

Yes, I standardized the data
Using USNAME---->firstname,lastname
Using USAREA---->city,state,zip
UsingUSADDR----->addr1,addr2,addr3

After standardizing I used match frequency for the frequency data for source and reference.

But what happened is the source frequency records were 3500 and the reference frequency records were 4000.

After this I designed a job with reference match and used a match specification and when I run the reference match job the source frequency and reference frequency records are doubled.
i.e 7000 and 8000 come as input to reference match.

why does this happen?

Is this usual or unusual?

Thanks,