Match Specification - NO MATCHES FOUND
-
- Participant
- Posts: 55
- Joined: Tue Sep 20, 2005 10:58 am
Match Specification - NO MATCHES FOUND
Hi All
I am trying to understand how to he match specifications work
My sample data file has 3 rows
John Smith,83713
Johnathan Smith, 83713
Jon Smith, 83713
My objective is to make one of them survice as they are duplicates with different name
I ran standardization on these using USNAME and USAREA rule sets for Name and Zipcode. I did not use USPREP as this a very simple example with not much free form addressing
I build my match specification as follows
Blocking done by
Zipcode field. ( My understanding is to find matches within the same zipcode)
Match Criteria
FirstName_USNAME type name_uncert
PrimaryName_USNAME type uncert
Zip type CNT_DIFF (Not sure I understood this very well but following what was done in the IBM class)
I leave the default u and m probabilities.
When I ran test all passes it said NO MATCHES FOUND
First of all I was thinking it will at least find the match.
I want to try and make a very simple example work before working on difficult ones
Any ideas where I am wrong here
Thanks for all your help
Arvind
I am trying to understand how to he match specifications work
My sample data file has 3 rows
John Smith,83713
Johnathan Smith, 83713
Jon Smith, 83713
My objective is to make one of them survice as they are duplicates with different name
I ran standardization on these using USNAME and USAREA rule sets for Name and Zipcode. I did not use USPREP as this a very simple example with not much free form addressing
I build my match specification as follows
Blocking done by
Zipcode field. ( My understanding is to find matches within the same zipcode)
Match Criteria
FirstName_USNAME type name_uncert
PrimaryName_USNAME type uncert
Zip type CNT_DIFF (Not sure I understood this very well but following what was done in the IBM class)
I leave the default u and m probabilities.
When I ran test all passes it said NO MATCHES FOUND
First of all I was thinking it will at least find the match.
I want to try and make a very simple example work before working on difficult ones
Any ideas where I am wrong here
Thanks for all your help
Arvind
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
They look like they should generate a set. What are your cutoff settings? What are the weights generated by your data - both the aggregate weights and the individual field weights? Finally, what parameter did you provide for your name_uncert matches?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 55
- Joined: Tue Sep 20, 2005 10:58 am
Sorry for my delayed response. Could not get time to work on my simple sample to understand as we had other production issues. Anyways
I have since played a little bit with the data values and I still have questions
Initially I had my cutoff values set at 25 for clerical and Match and it said no matches found
Basically it had NO psuedo matches, No clerical pairs and 3 Data Residuals.
Then I went the other extreme and set the cut off values to b 0 for both clerical and match
It now said it had 1 psuedo match and two data residuals
For John Smith it had a weight of 1.45 and record type of XA. ( Not sure what record type of XA means )
For Jonathan Smith it had a weight of 1.45 as well but record type of DA.
What I am not sure is why is Jon Smith left out
Does QualityStage only matches in pairs.
Why would Jon Smith be a data residual?
I know I am asking fundamental questions but that is exactly why I am trying to use a simple example to understand how the match logic works
Any help is greatly appreciated
Thanks
Arvind
I have since played a little bit with the data values and I still have questions
Initially I had my cutoff values set at 25 for clerical and Match and it said no matches found
Basically it had NO psuedo matches, No clerical pairs and 3 Data Residuals.
Then I went the other extreme and set the cut off values to b 0 for both clerical and match
It now said it had 1 psuedo match and two data residuals
For John Smith it had a weight of 1.45 and record type of XA. ( Not sure what record type of XA means )
For Jonathan Smith it had a weight of 1.45 as well but record type of DA.
What I am not sure is why is Jon Smith left out
Does QualityStage only matches in pairs.
Why would Jon Smith be a data residual?
I know I am asking fundamental questions but that is exactly why I am trying to use a simple example to understand how the match logic works
Any help is greatly appreciated
Thanks
Arvind
-
- Participant
- Posts: 55
- Joined: Tue Sep 20, 2005 10:58 am
Hi Ray
Unduplicate_Match_2,0: Variable: PrimaryName_USNA
New MISSING weight for ALL values: 0.00
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
SMITH 3 0 0 1.00 1.00 D 0.01 0.00
Variable: Zip
New MISSING weight for ALL values: 0.00
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
83713 3 0 0 1.00 1.00 D 0.01 0.00
Variable: FirstName_USNAME
New MISSING weight for ALL values: 0.00
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
JOHN 1 0 0 0.90 0.33 D 1.43 -2.73
JOHNATHAN 1 0 0 0.90 0.33 D 1.43 -2.73
JON 1 0 0 0.90 0.33 D 1.43 -2.73
Composite weights
Unduplicate_Match_2,0: Default weights calculated for values OUTSIDE table
Columns are: A Var Name, User mprob, Revised mprob, User uprob, Revised uprob, Agree weight, Disagree weight, Missing weight
PrimaryName_USNA 0.90 0.9999 0.01 0.9999 0.00 0.00 0.00
Zip 0.90 0.9999 0.01 0.9999 0.00 0.00 0.00
FirstName_USNAME 0.90 0.9000 0.01 0.3333 1.43 -2.73 0.00
Unduplicate_Match_2,0: Variable: PrimaryName_USNA
New MISSING weight for ALL values: 0.00
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
SMITH 3 0 0 1.00 1.00 D 0.01 0.00
Variable: Zip
New MISSING weight for ALL values: 0.00
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
83713 3 0 0 1.00 1.00 D 0.01 0.00
Variable: FirstName_USNAME
New MISSING weight for ALL values: 0.00
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
JOHN 1 0 0 0.90 0.33 D 1.43 -2.73
JOHNATHAN 1 0 0 0.90 0.33 D 1.43 -2.73
JON 1 0 0 0.90 0.33 D 1.43 -2.73
Composite weights
Unduplicate_Match_2,0: Default weights calculated for values OUTSIDE table
Columns are: A Var Name, User mprob, Revised mprob, User uprob, Revised uprob, Agree weight, Disagree weight, Missing weight
PrimaryName_USNA 0.90 0.9999 0.01 0.9999 0.00 0.00 0.00
Zip 0.90 0.9999 0.01 0.9999 0.00 0.00 0.00
FirstName_USNAME 0.90 0.9000 0.01 0.3333 1.43 -2.73 0.00
-
- Participant
- Posts: 55
- Joined: Tue Sep 20, 2005 10:58 am
My basic question in my quest to understand the quality stage functionality is
How can I make these three records as matches.
Jon smith seems to fall out of the loop when I run my unduplicate job with match specification set to 0 for clerical and matches
Also what does it mean when one says this record is a data residual. is that a record does not match?
Arvind
How can I make these three records as matches.
Jon smith seems to fall out of the loop when I run my unduplicate job with match specification set to 0 for clerical and matches
Also what does it mean when one says this record is a data residual. is that a record does not match?
Arvind
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The issue is that John Smith is a very common name, so that the probability of finding a match purely by chance (the "U probability") is quite high. This is reflected in the low aggregate weight; indeed this might take the aggregate weight to a negative value (a "disagreement weight"). Since your data records contain no other fields that could contribute to the confidence in a match (such as date of birth, address, etc.), you will only see a match with the match cutoff set very low (as you have seen).
Add more fields and/or deliberately reduce the u-prob figure and/or apply weight overrides to increase the likelihood of match with this far too small sample.
Add more fields and/or deliberately reduce the u-prob figure and/or apply weight overrides to increase the likelihood of match with this far too small sample.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Hi.. i have a small question. where do you find the detailed information on what weight has been assigned to each field. is there anyfile that gets generated from where i can find the default weight calculated by quality stage?aramachandra wrote:Hi Ray
Unduplicate_Match_2,0: Variable: PrimaryName_USNA
New MISSING weight for ALL values: 0.00
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
SMITH 3 0 0 1.00 1.00 D 0.01 0.00
Variable: Zip
New MISSING weight for ALL values: 0.00
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
83713 3 0 0 1.00 1.00 D 0.01 0.00
Variable: FirstName_USNAME
New MISSING weight for ALL values: 0.00
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
JOHN 1 0 0 0.90 0.33 D 1.43 -2.73
JOHNATHAN 1 0 0 0.90 0.33 D 1.43 -2.73
JON 1 0 0 0.90 0.33 D 1.43 -2.73
Composite weights
Unduplicate_Match_2,0: Default weights calculated for values OUTSIDE table
Columns are: A Var Name, User mprob, Revised mprob, User uprob, Revised uprob, Agree weight, Disagree weight, Missing weight
PrimaryName_USNA 0.90 0.9999 0.01 0.9999 0.00 0.00 0.00
Zip 0.90 0.9999 0.01 0.9999 0.00 0.00 0.00
FirstName_USNAME 0.90 0.9000 0.01 0.3333 1.43 -2.73 0.00
Thanks,
g.kiran
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
(In version 8) you can simply right click when testing the match specification and look at the individual field weights.
I'd have to research how to do it in version 7.5 and earlier, but recall that it can be generated in one of the reports.
I'd have to research how to do it in version 7.5 and earlier, but recall that it can be generated in one of the reports.
Last edited by ray.wurlod on Mon Aug 11, 2008 11:32 pm, edited 1 time in total.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ray.wurlod wrote:(In version you can simply right click when testing the match specification and look at the individual field weights.
I'd have to research how to do it in version 7.5 and earlier, but recall that it can be generated in one of the reports.
Hi Ray,
Unfortunately we are still working on ver 7.0. So it would be a great help if you can help me find the individual weight information.
Also, i am specifically trying to analyze how the fields are assigned weights.
In my example, i am comparing 2 files say A & B on LNUSNAM and FNUSNAM.
Both the files have same first and last name record. (1 to 1 Match)
When i apply the m prob of .999 and u prob of .001 on both LNUSNAM and FNUSNAM , the composite weight is "30.56" and whe i change the m & U probability to 0.9 and 0.1 respectively, the composite weight is "30.26". The text in FNUSNAM (ex: RHONDA) is appearing for 13 times in the input file.
What i was trying to do is, to see if the composite weight is direcltly calculated from the sum of individual agreement weights. But it is not.
Is there a relation that can be established on : the Agreement weight and Frequency of Occuarance of the text to the Composite weight?
This relation would be of great help as that would help in understanding the Weight concept and specifying the Match and Clerical Cutoff's.
Thanks.
g.kiran
-
- Participant
- Posts: 55
- Joined: Tue Sep 20, 2005 10:58 am
Sorry for the delayed response ...again
Based on my last email
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
JOHN 1 0 0 0.90 0.33 D 1.43 -2.73
JOHNATHAN 1 0 0 0.90 0.33 D 1.43 -2.73
JON 1 0 0 0.90 0.33 D 1.43 -2.73
I think I get it that qualitystage is telling me that John and Johnathan are close matches and hence I need to survice one of them as my master record
But Jon is treated as a residual
At least that is what it does for cutoff's of 0
Looking at the weights above, agree weight and disagree weights are identifical but still only "Jon" is considered residual.
Is there any reason why datastage considered Job as a residual rather than telling me that all three of them matches and you should survive only one of the three instead of the two it is giving me currently
Sorry for this basic question but I cannot seem to figure why datastage would consider Jon to be a residual as against John and Johnathan
Arvind
Based on my last email
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
JOHN 1 0 0 0.90 0.33 D 1.43 -2.73
JOHNATHAN 1 0 0 0.90 0.33 D 1.43 -2.73
JON 1 0 0 0.90 0.33 D 1.43 -2.73
I think I get it that qualitystage is telling me that John and Johnathan are close matches and hence I need to survice one of them as my master record
But Jon is treated as a residual
At least that is what it does for cutoff's of 0
Looking at the weights above, agree weight and disagree weights are identifical but still only "Jon" is considered residual.
Is there any reason why datastage considered Job as a residual rather than telling me that all three of them matches and you should survive only one of the three instead of the two it is giving me currently
Sorry for this basic question but I cannot seem to figure why datastage would consider Jon to be a residual as against John and Johnathan
Arvind
Re: Match Specification - NO MATCHES FOUND
Hi ,
If your business accepts the john,jon,johnathan are same then in the classfication table give identical output value to all 3 firstname .
I.E.
john john F
jon john F
johnathan john F
now standardize and match.
you can bring all 3 names in one matchset.
Regards
Vairamuthu
If your business accepts the john,jon,johnathan are same then in the classfication table give identical output value to all 3 firstname .
I.E.
john john F
jon john F
johnathan john F
now standardize and match.
you can bring all 3 names in one matchset.
Regards
Vairamuthu
aramachandra wrote:Hi All
I am trying to understand how to he match specifications work
My sample data file has 3 rows
John Smith,83713
Johnathan Smith, 83713
Jon Smith, 83713
My objective is to make one of them survice as they are duplicates with different name
I ran standardization on these using USNAME and USAREA rule sets for Name and Zipcode. I did not use USPREP as this a very simple example with not much free form addressing
I build my match specification as follows
Blocking done by
Zipcode field. ( My understanding is to find matches within the same zipcode)
Match Criteria
FirstName_USNAME type name_uncert
PrimaryName_USNAME type uncert
Zip type CNT_DIFF (Not sure I understood this very well but following what was done in the IBM class)
I leave the default u and m probabilities.
When I ran test all passes it said NO MATCHES FOUND
First of all I was thinking it will at least find the match.
I want to try and make a very simple example work before working on difficult ones
Any ideas where I am wrong here
Thanks for all your help
Arvind
vairamuthu