Match specification
-
- Participant
- Posts: 248
- Joined: Fri Jul 30, 2010 9:04 am
Match specification
A job is designed which has reference match stage used in it.
This reference match stage has a match specification which has 3 passes in it.
Now how can I test the passes to find the match hit rate.
Do I need to run the job or is there any other procedure to run the passes or find the hit rate.
can anyone help me with this?
This reference match stage has a match specification which has 3 passes in it.
Now how can I test the passes to find the match hit rate.
Do I need to run the job or is there any other procedure to run the passes or find the hit rate.
can anyone help me with this?
Regards,
Kenny
Kenny
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The match specification designer allows you to place passes into a "holding area" so that particular combinations of match passes may be tested. Testing is done within that designer; you don't need formally to run the job. And you can specify a sample of data if you wish.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 248
- Joined: Fri Jul 30, 2010 9:04 am
Yes,I can test the passes now.
There are 3 passes in the match specification.
I can see the pass statistics, How do I know the hit rate is more because as of now in the pass statics I see that 85% is residuals, 10% matched and 5% clerical.
If the hit rate is more is that match % would be more?
please clarify my query
There are 3 passes in the match specification.
I can see the pass statistics, How do I know the hit rate is more because as of now in the pass statics I see that 85% is residuals, 10% matched and 5% clerical.
If the hit rate is more is that match % would be more?
please clarify my query
Regards,
Kenny
Kenny
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Depends what you mean by "hit". I assume you mean "match". In that case, the match percentage is a measure of hit rate. But some of the clericals may end up being true matches once reviewed (you won't know till that's been done). "Match" in this context is the percentage of records that exist in a set with a master and at least one duplicate.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 248
- Joined: Fri Jul 30, 2010 9:04 am
Yes,hit rate is nothing but match rate.
What I see in the 3 passes is the residuals have more % and next clericals and then match.
So to increase the match rate,The only thing to do is choose blocking columns and matching commands with a match cutoff.
Is the above conclusion right?
Also to check the match rate I need to see the pass statistics or Is there any other way to check the match rate?
What I see in the 3 passes is the residuals have more % and next clericals and then match.
So to increase the match rate,The only thing to do is choose blocking columns and matching commands with a match cutoff.
Is the above conclusion right?
Also to check the match rate I need to see the pass statistics or Is there any other way to check the match rate?
Regards,
Kenny
Kenny
-
- Participant
- Posts: 248
- Joined: Fri Jul 30, 2010 9:04 am
Source has 3 million records and the reference has 0.2 million records.
The source and reference along with their frequency's have been matched using reference match with a match specification which has 3 passes.
The outputs I used for the match reference are matched and clerical.
After I run the job I see that the matched records are 89,600 and the clerical records are 14,200 but when I see the pass statistics there I see a different figure like
pass1, Pass2, Pass3
Blocks processed:- 384, 57373, 25966
Matched pairs:-569,2631, 18
Exact matched pairs:-430,0, 0
Clerical:-0,59468,9
Reference duplicates:-592, 44951, 11
Exact reference 407, 0, 0
duplicates:-
Data residuals:- 207267, 124382, 204633
Why are the records different in the job and statistics?
please suggest if any changes are required for improving the match rate or count ??
The source and reference along with their frequency's have been matched using reference match with a match specification which has 3 passes.
The outputs I used for the match reference are matched and clerical.
After I run the job I see that the matched records are 89,600 and the clerical records are 14,200 but when I see the pass statistics there I see a different figure like
pass1, Pass2, Pass3
Blocks processed:- 384, 57373, 25966
Matched pairs:-569,2631, 18
Exact matched pairs:-430,0, 0
Clerical:-0,59468,9
Reference duplicates:-592, 44951, 11
Exact reference 407, 0, 0
duplicates:-
Data residuals:- 207267, 124382, 204633
Why are the records different in the job and statistics?
please suggest if any changes are required for improving the match rate or count ??
Regards,
Kenny
Kenny
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The match rate is primarily driven by your data. It may be impossible to "improve" it. If the match specifications are correct, the matches in the data (and only the matches in the data) will be detected.kennyapril wrote:Please suggest if any changes are required for improving the match rate or count ??
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 248
- Joined: Fri Jul 30, 2010 9:04 am
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Participant
- Posts: 248
- Joined: Fri Jul 30, 2010 9:04 am
The blocking columns in the three passes were
1.NYSIISfirstname,last_name,zip
2.License_num,License_state,gender
3.last_name,gender,first_name
The matching commands contain the other columns
firstname,middlename,street_no,first_init,gender,last_name
this is reference match and has onetomany multiple type.
any changes needed for the blocking columns or matching commands?
1.NYSIISfirstname,last_name,zip
2.License_num,License_state,gender
3.last_name,gender,first_name
The matching commands contain the other columns
firstname,middlename,street_no,first_init,gender,last_name
this is reference match and has onetomany multiple type.
any changes needed for the blocking columns or matching commands?
Regards,
Kenny
Kenny
-
- Participant
- Posts: 527
- Joined: Thu Apr 19, 2007 1:25 am
- Location: Melbourne
Try making your blocking a little looser.
Remember, blocking is exact, so anything that might have scored ok through inexact matches won't get in because blocking kept it out.
Experiment a bit with your blocking and matching fields. Some choices will be better than others, but give it a go: probabilistic matching is not a cookie cutter exercise.
And I'm sure you did this, but I have to ask: did you standardise the data first? It can make a huge difference.
Remember, blocking is exact, so anything that might have scored ok through inexact matches won't get in because blocking kept it out.
Experiment a bit with your blocking and matching fields. Some choices will be better than others, but give it a go: probabilistic matching is not a cookie cutter exercise.
And I'm sure you did this, but I have to ask: did you standardise the data first? It can make a huge difference.
-
- Participant
- Posts: 248
- Joined: Fri Jul 30, 2010 9:04 am
So, just play around with the blocking and matching criteria.
Yes, I standardized the data
Using USNAME---->firstname,lastname
Using USAREA---->city,state,zip
UsingUSADDR----->addr1,addr2,addr3
After standardizing I used match frequency for the frequency data for source and reference.
But what happened is the source frequency records were 3500 and the reference frequency records were 4000.
After this I designed a job with reference match and used a match specification and when I run the reference match job the source frequency and reference frequency records are doubled.
i.e 7000 and 8000 come as input to reference match.
why does this happen?
Is this usual or unusual?
Thanks,
Yes, I standardized the data
Using USNAME---->firstname,lastname
Using USAREA---->city,state,zip
UsingUSADDR----->addr1,addr2,addr3
After standardizing I used match frequency for the frequency data for source and reference.
But what happened is the source frequency records were 3500 and the reference frequency records were 4000.
After this I designed a job with reference match and used a match specification and when I run the reference match job the source frequency and reference frequency records are doubled.
i.e 7000 and 8000 come as input to reference match.
why does this happen?
Is this usual or unusual?
Thanks,
Regards,
Kenny
Kenny