Match Specification - NO MATCHES FOUND

aramachandra · Post by **aramachandra** » Mon Aug 04, 2008 2:33 pm

Hi All

I am trying to understand how to he match specifications work

My sample data file has 3 rows

John Smith,83713
Johnathan Smith, 83713
Jon Smith, 83713

My objective is to make one of them survice as they are duplicates with different name

I ran standardization on these using USNAME and USAREA rule sets for Name and Zipcode. I did not use USPREP as this a very simple example with not much free form addressing

I build my match specification as follows

Blocking done by
Zipcode field. ( My understanding is to find matches within the same zipcode)

Match Criteria

FirstName_USNAME type name_uncert
PrimaryName_USNAME type uncert
Zip type CNT_DIFF (Not sure I understood this very well but following what was done in the IBM class)

I leave the default u and m probabilities.

When I ran test all passes it said NO MATCHES FOUND

First of all I was thinking it will at least find the match.

I want to try and make a very simple example work before working on difficult ones

Any ideas where I am wrong here

Thanks for all your help

Arvind

ray.wurlod · Post by **ray.wurlod** » Mon Aug 04, 2008 3:03 pm

They look like they should generate a set. What are your cutoff settings? What are the weights generated by your data - both the aggregate weights and the individual field weights? Finally, what parameter did you provide for your name_uncert matches?

aramachandra · Post by **aramachandra** » Wed Aug 06, 2008 11:49 am

Sorry for my delayed response. Could not get time to work on my simple sample to understand as we had other production issues. Anyways

I have since played a little bit with the data values and I still have questions

Initially I had my cutoff values set at 25 for clerical and Match and it said no matches found

Basically it had NO psuedo matches, No clerical pairs and 3 Data Residuals.

Then I went the other extreme and set the cut off values to b 0 for both clerical and match

It now said it had 1 psuedo match and two data residuals

For John Smith it had a weight of 1.45 and record type of XA. ( Not sure what record type of XA means )

For Jonathan Smith it had a weight of 1.45 as well but record type of DA.

What I am not sure is why is Jon Smith left out

Does QualityStage only matches in pairs.

Why would Jon Smith be a data residual?

I know I am asking fundamental questions but that is exactly why I am trying to use a simple example to understand how the match logic works

Any help is greatly appreciated

Thanks
Arvind

aramachandra · Post by **aramachandra** » Wed Aug 06, 2008 1:19 pm

Hi Ray

Unduplicate_Match_2,0: Variable: PrimaryName_USNA
New MISSING weight for ALL values: 0.00
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
SMITH 3 0 0 1.00 1.00 D 0.01 0.00
Variable: Zip
New MISSING weight for ALL values: 0.00
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
83713 3 0 0 1.00 1.00 D 0.01 0.00
Variable: FirstName_USNAME
New MISSING weight for ALL values: 0.00
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
JOHN 1 0 0 0.90 0.33 D 1.43 -2.73
JOHNATHAN 1 0 0 0.90 0.33 D 1.43 -2.73
JON 1 0 0 0.90 0.33 D 1.43 -2.73

Composite weights

Unduplicate_Match_2,0: Default weights calculated for values OUTSIDE table
Columns are: A Var Name, User mprob, Revised mprob, User uprob, Revised uprob, Agree weight, Disagree weight, Missing weight
PrimaryName_USNA 0.90 0.9999 0.01 0.9999 0.00 0.00 0.00
Zip 0.90 0.9999 0.01 0.9999 0.00 0.00 0.00
FirstName_USNAME 0.90 0.9000 0.01 0.3333 1.43 -2.73 0.00

aramachandra · Post by **aramachandra** » Wed Aug 06, 2008 1:24 pm

My basic question in my quest to understand the quality stage functionality is

How can I make these three records as matches.

Jon smith seems to fall out of the loop when I run my unduplicate job with match specification set to 0 for clerical and matches

Also what does it mean when one says this record is a data residual. is that a record does not match?

Arvind

ray.wurlod · Post by **ray.wurlod** » Wed Aug 06, 2008 3:04 pm

The issue is that John Smith is a very common name, so that the probability of finding a match purely by chance (the "U probability") is quite high. This is reflected in the low aggregate weight; indeed this might take the aggregate weight to a negative value (a "disagreement weight"). Since your data records contain no other fields that could contribute to the confidence in a match (such as date of birth, address, etc.), you will only see a match with the match cutoff set very low (as you have seen).
Add more fields and/or deliberately reduce the u-prob figure and/or apply weight overrides to increase the likelihood of match with this far too small sample.

DSkkk · Post by **DSkkk** » Mon Aug 11, 2008 8:23 pm

aramachandra wrote:Hi Ray

Unduplicate_Match_2,0: Variable: PrimaryName_USNA
New MISSING weight for ALL values: 0.00
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
SMITH 3 0 0 1.00 1.00 D 0.01 0.00
Variable: Zip
New MISSING weight for ALL values: 0.00
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
83713 3 0 0 1.00 1.00 D 0.01 0.00
Variable: FirstName_USNAME
New MISSING weight for ALL values: 0.00
ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
JOHN 1 0 0 0.90 0.33 D 1.43 -2.73
JOHNATHAN 1 0 0 0.90 0.33 D 1.43 -2.73
JON 1 0 0 0.90 0.33 D 1.43 -2.73

Composite weights

Unduplicate_Match_2,0: Default weights calculated for values OUTSIDE table
Columns are: A Var Name, User mprob, Revised mprob, User uprob, Revised uprob, Agree weight, Disagree weight, Missing weight
PrimaryName_USNA 0.90 0.9999 0.01 0.9999 0.00 0.00 0.00
Zip 0.90 0.9999 0.01 0.9999 0.00 0.00 0.00
FirstName_USNAME 0.90 0.9000 0.01 0.3333 1.43 -2.73 0.00

Hi.. i have a small question. where do you find the detailed information on what weight has been assigned to each field. is there anyfile that gets generated from where i can find the default weight calculated by quality stage?

Thanks,

ray.wurlod · Post by **ray.wurlod** » Mon Aug 11, 2008 8:27 pm

(In version 8) you can simply right click when testing the match specification and look at the individual field weights.

I'd have to research how to do it in version 7.5 and earlier, but recall that it can be generated in one of the reports.

DSkkk · Post by **DSkkk** » Mon Aug 11, 2008 8:52 pm

ray.wurlod wrote:(In version you can simply right click when testing the match specification and look at the individual field weights.

I'd have to research how to do it in version 7.5 and earlier, but recall that it can be generated in one of the reports.

Hi Ray,

Unfortunately we are still working on ver 7.0. So it would be a great help if you can help me find the individual weight information.
Also, i am specifically trying to analyze how the fields are assigned weights.

In my example, i am comparing 2 files say A & B on LNUSNAM and FNUSNAM.

Both the files have same first and last name record. (1 to 1 Match)

When i apply the m prob of .999 and u prob of .001 on both LNUSNAM and FNUSNAM , the composite weight is "30.56" and whe i change the m & U probability to 0.9 and 0.1 respectively, the composite weight is "30.26". The text in FNUSNAM (ex: RHONDA) is appearing for 13 times in the input file.

What i was trying to do is, to see if the composite weight is direcltly calculated from the sum of individual agreement weights. But it is not.

Is there a relation that can be established on : the Agreement weight and Frequency of Occuarance of the text to the Composite weight?

This relation would be of great help as that would help in understanding the Weight concept and specifying the Match and Clerical Cutoff's.

Thanks.

aramachandra · Post by **aramachandra** » Wed Aug 13, 2008 12:47 pm

Sorry for the delayed response ...again

Based on my last email

ALL VALUES INSIDE TABLE
Columns are: Value, Frequency, MAgree, MPart, mprob, uprob, type, Agree weight, Disagree weight
JOHN 1 0 0 0.90 0.33 D 1.43 -2.73
JOHNATHAN 1 0 0 0.90 0.33 D 1.43 -2.73
JON 1 0 0 0.90 0.33 D 1.43 -2.73

I think I get it that qualitystage is telling me that John and Johnathan are close matches and hence I need to survice one of them as my master record

But Jon is treated as a residual

At least that is what it does for cutoff's of 0

Looking at the weights above, agree weight and disagree weights are identifical but still only "Jon" is considered residual.

Is there any reason why datastage considered Job as a residual rather than telling me that all three of them matches and you should survive only one of the three instead of the two it is giving me currently

Sorry for this basic question but I cannot seem to figure why datastage would consider Jon to be a residual as against John and Johnathan

Arvind

vairus · Post by **vairus** » Mon Aug 25, 2008 5:53 am

Hi ,

If your business accepts the john,jon,johnathan are same then in the classfication table give identical output value to all 3 firstname .

I.E.

john john F
jon john F
johnathan john F

now standardize and match.

you can bring all 3 names in one matchset.

Regards
Vairamuthu

aramachandra wrote:Hi All

I am trying to understand how to he match specifications work

My sample data file has 3 rows

John Smith,83713
Johnathan Smith, 83713
Jon Smith, 83713

My objective is to make one of them survice as they are duplicates with different name

I ran standardization on these using USNAME and USAREA rule sets for Name and Zipcode. I did not use USPREP as this a very simple example with not much free form addressing

I build my match specification as follows

Blocking done by
Zipcode field. ( My understanding is to find matches within the same zipcode)

Match Criteria

FirstName_USNAME type name_uncert
PrimaryName_USNAME type uncert
Zip type CNT_DIFF (Not sure I understood this very well but following what was done in the IBM class)

I leave the default u and m probabilities.

When I ran test all passes it said NO MATCHES FOUND

First of all I was thinking it will at least find the match.

I want to try and make a very simple example work before working on difficult ones

Any ideas where I am wrong here

Thanks for all your help

Arvind

DSXchange

Match Specification - NO MATCHES FOUND

Match Specification - NO MATCHES FOUND

Re: Match Specification - NO MATCHES FOUND