Match Specification - NO MATCHES FOUND

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
aramachandra
Participant
Posts: 55
Joined: Tue Sep 20, 2005 10:58 am

Match Specification - NO MATCHES FOUND

Post by aramachandra »

Hi All

I am trying to understand how to he match specifications work

My sample data file has 3 rows

John Smith,83713
Johnathan Smith, 83713
Jon Smith, 83713

My objective is to make one of them survice as they are duplicates with different name

I ran standardization on these using USNAME and USAREA rule sets for Name and Zipcode. I did not use USPREP as this a very simple example with not much free form addressing


I build my match specification as follows

Blocking done by
Zipcode field. ( My understanding is to find matches within the same zipcode)


Match Criteria

FirstName_USNAME type name_uncert
PrimaryName_USNAME type uncert
Zip type CNT_DIFF (Not sure I understood this very well but following what was done in the IBM class)

I leave the default u and m probabilities.

When I ran test all passes it said NO MATCHES FOUND

First of all I was thinking it will at least find the match.

I want to try and make a very simple example work before working on difficult ones

Any ideas where I am wrong here

Thanks for all your help

Arvind
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

They look like they should generate a set. What are your cutoff settings? What are the weights generated by your data - both the aggregate weights and the individual field weights? Finally, what parameter did you provide for your name_uncert matches?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
aramachandra
Participant
Posts: 55
Joined: Tue Sep 20, 2005 10:58 am

Post by aramachandra »

Sorry for my delayed response. Could not get time to work on my simple sample to understand as we had other production issues. Anyways

I have since played a little bit with the data values and I still have questions

Initially I had my cutoff values set at 25 for clerical and Match and it said no matches found

Basically it had NO psuedo matches, No clerical pairs and 3 Data Residuals.

Then I went the other extreme and set the cut off values to b 0 for both clerical and match

It now said it had 1 psuedo match and two data residuals

For John Smith it had a weight of 1.45 and record type of XA. ( Not sure what record type of XA means )

For Jonathan Smith it had a weight of 1.45 as well but record type of DA.

What I am not sure is why is Jon Smith left out

Does QualityStage only matches in pairs.

Why would Jon Smith be a data residual?

I know I am asking fundamental questions but that is exactly why I am trying to use a simple example to understand how the match logic works

Any help is greatly appreciated

Thanks
Arvind
aramachandra
Participant
Posts: 55
Joined: Tue Sep 20, 2005 10:58 am

Post by aramachandra »

Hi Ray

Unduplicate_Match_2,0: Variable: PrimaryName_USNA
New MISSING weight for ALL values: 0.00
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
SMITH 3 0 0 1.00 1.00 D 0.01 0.00
Variable: Zip
New MISSING weight for ALL values: 0.00
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
83713 3 0 0 1.00 1.00 D 0.01 0.00
Variable: FirstName_USNAME
New MISSING weight for ALL values: 0.00
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
JOHN 1 0 0 0.90 0.33 D 1.43 -2.73
JOHNATHAN 1 0 0 0.90 0.33 D 1.43 -2.73
JON 1 0 0 0.90 0.33 D 1.43 -2.73





Composite weights

Unduplicate_Match_2,0: Default weights calculated for values OUTSIDE table
Columns are: A Var Name, User mprob, Revised mprob, User uprob, Revised uprob, Agree weight, Disagree weight, Missing weight
PrimaryName_USNA 0.90 0.9999 0.01 0.9999 0.00 0.00 0.00
Zip 0.90 0.9999 0.01 0.9999 0.00 0.00 0.00
FirstName_USNAME 0.90 0.9000 0.01 0.3333 1.43 -2.73 0.00
aramachandra
Participant
Posts: 55
Joined: Tue Sep 20, 2005 10:58 am

Post by aramachandra »

My basic question in my quest to understand the quality stage functionality is

How can I make these three records as matches.

Jon smith seems to fall out of the loop when I run my unduplicate job with match specification set to 0 for clerical and matches

Also what does it mean when one says this record is a data residual. is that a record does not match?


Arvind
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The issue is that John Smith is a very common name, so that the probability of finding a match purely by chance (the "U probability") is quite high. This is reflected in the low aggregate weight; indeed this might take the aggregate weight to a negative value (a "disagreement weight"). Since your data records contain no other fields that could contribute to the confidence in a match (such as date of birth, address, etc.), you will only see a match with the match cutoff set very low (as you have seen).
Add more fields and/or deliberately reduce the u-prob figure and/or apply weight overrides to increase the likelihood of match with this far too small sample.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
DSkkk
Charter Member
Charter Member
Posts: 70
Joined: Fri Nov 05, 2004 1:10 pm

Post by DSkkk »

aramachandra wrote:Hi Ray

Unduplicate_Match_2,0: Variable: PrimaryName_USNA
New MISSING weight for ALL values: 0.00
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
SMITH 3 0 0 1.00 1.00 D 0.01 0.00
Variable: Zip
New MISSING weight for ALL values: 0.00
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
83713 3 0 0 1.00 1.00 D 0.01 0.00
Variable: FirstName_USNAME
New MISSING weight for ALL values: 0.00
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
JOHN 1 0 0 0.90 0.33 D 1.43 -2.73
JOHNATHAN 1 0 0 0.90 0.33 D 1.43 -2.73
JON 1 0 0 0.90 0.33 D 1.43 -2.73





Composite weights

Unduplicate_Match_2,0: Default weights calculated for values OUTSIDE table
Columns are: A Var Name, User mprob, Revised mprob, User uprob, Revised uprob, Agree weight, Disagree weight, Missing weight
PrimaryName_USNA 0.90 0.9999 0.01 0.9999 0.00 0.00 0.00
Zip 0.90 0.9999 0.01 0.9999 0.00 0.00 0.00
FirstName_USNAME 0.90 0.9000 0.01 0.3333 1.43 -2.73 0.00
Hi.. i have a small question. where do you find the detailed information on what weight has been assigned to each field. is there anyfile that gets generated from where i can find the default weight calculated by quality stage?

Thanks,
g.kiran
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

(In version 8) you can simply right click when testing the match specification and look at the individual field weights.

I'd have to research how to do it in version 7.5 and earlier, but recall that it can be generated in one of the reports.
Last edited by ray.wurlod on Mon Aug 11, 2008 11:32 pm, edited 1 time in total.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
DSkkk
Charter Member
Charter Member
Posts: 70
Joined: Fri Nov 05, 2004 1:10 pm

Post by DSkkk »

ray.wurlod wrote:(In version 8) you can simply right click when testing the match specification and look at the individual field weights.

I'd have to research how to do it in version 7.5 and earlier, but recall that it can be generated in one of the reports.

Hi Ray,

Unfortunately we are still working on ver 7.0. So it would be a great help if you can help me find the individual weight information.
Also, i am specifically trying to analyze how the fields are assigned weights.

In my example, i am comparing 2 files say A & B on LNUSNAM and FNUSNAM.

Both the files have same first and last name record. (1 to 1 Match)

When i apply the m prob of .999 and u prob of .001 on both LNUSNAM and FNUSNAM , the composite weight is "30.56" and whe i change the m & U probability to 0.9 and 0.1 respectively, the composite weight is "30.26". The text in FNUSNAM (ex: RHONDA) is appearing for 13 times in the input file.

What i was trying to do is, to see if the composite weight is direcltly calculated from the sum of individual agreement weights. But it is not.

Is there a relation that can be established on : the Agreement weight and Frequency of Occuarance of the text to the Composite weight?

This relation would be of great help as that would help in understanding the Weight concept and specifying the Match and Clerical Cutoff's.

Thanks.
g.kiran
aramachandra
Participant
Posts: 55
Joined: Tue Sep 20, 2005 10:58 am

Post by aramachandra »

Sorry for the delayed response ...again

Based on my last email

ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
JOHN 1 0 0 0.90 0.33 D 1.43 -2.73
JOHNATHAN 1 0 0 0.90 0.33 D 1.43 -2.73
JON 1 0 0 0.90 0.33 D 1.43 -2.73


I think I get it that qualitystage is telling me that John and Johnathan are close matches and hence I need to survice one of them as my master record

But Jon is treated as a residual

At least that is what it does for cutoff's of 0

Looking at the weights above, agree weight and disagree weights are identifical but still only "Jon" is considered residual.

Is there any reason why datastage considered Job as a residual rather than telling me that all three of them matches and you should survive only one of the three instead of the two it is giving me currently

Sorry for this basic question but I cannot seem to figure why datastage would consider Jon to be a residual as against John and Johnathan


Arvind
vairus
Participant
Posts: 52
Joined: Thu Feb 07, 2008 8:02 am
Location: Johannesburg

Re: Match Specification - NO MATCHES FOUND

Post by vairus »

Hi ,

If your business accepts the john,jon,johnathan are same then in the classfication table give identical output value to all 3 firstname .

I.E.

john john F
jon john F
johnathan john F

now standardize and match.

you can bring all 3 names in one matchset.

Regards
Vairamuthu
aramachandra wrote:Hi All

I am trying to understand how to he match specifications work

My sample data file has 3 rows

John Smith,83713
Johnathan Smith, 83713
Jon Smith, 83713

My objective is to make one of them survice as they are duplicates with different name

I ran standardization on these using USNAME and USAREA rule sets for Name and Zipcode. I did not use USPREP as this a very simple example with not much free form addressing


I build my match specification as follows

Blocking done by
Zipcode field. ( My understanding is to find matches within the same zipcode)


Match Criteria

FirstName_USNAME type name_uncert
PrimaryName_USNAME type uncert
Zip type CNT_DIFF (Not sure I understood this very well but following what was done in the IBM class)

I leave the default u and m probabilities.

When I ran test all passes it said NO MATCHES FOUND

First of all I was thinking it will at least find the match.

I want to try and make a very simple example work before working on difficult ones

Any ideas where I am wrong here

Thanks for all your help

Arvind
vairamuthu
Post Reply