Setting Cutoff Values

mdbatra · Post by **mdbatra** » Mon Apr 11, 2011 3:43 am

Hi,

I have just started learning QualityStage through the PDF documentation. But i guess i am missing something.

How do we set the Match/Clerical cutoffs. I mean how do we calculate that.

Also, on what basis do we set the m and u probability ?

Would apprciate any help.

Thanks.

ray.wurlod · Post by **ray.wurlod** » Mon Apr 11, 2011 4:13 pm

You set the cutoffs at the bottom of the match specification designer screen or, if you are testing the match specification, you can move the sliders on the histogram.

m-probability is driven by the error rate you're prepared to accept. The default (0.9) indicates that you're prepared to accept a 10% error rate - that is, m-probability is (1 - error rate).

u-probability measures the likelihood that, when a match is found, it is found as a result of random chance. The default (0.01) specifies that you believe that 1 in 100 matches will occur due to purely random chance.

mdbatra · Post by **mdbatra** » Tue Apr 12, 2011 6:06 am

Thanks for reply.

I observed that cutoff values are specified at the bottom of Match Designer window.

I think it would be good if i provide with you the intent. I need to perform Reference Match(Many-to-one) with the following(sample) data:

Source Data
Col1, Col2
A,International Bus. Machine
A,International B M
A,IBM
A,Inter. Bus. Machine

Referece Data
A,International Business Machine

In Specification, i defined [quote]Col1[quote] as Blocking column and the Abbreviation for [quote]Col2[quote] as Matching Columns( using the standardization process, evaluated as [quote]IBM[quote] here). The comparison type is set to [quote]CHAR[quote]. Match/clerical cutoffs are currently set to 0.

Now, by default m and u probabilities are set to .9 and .01 respectively. When i did a test run( just one pass as defined above), the weight for the matching records is [quote].58[quote]

The things which am not able to figure out are:
1. How is Match/Clerical cutoff derived on the basis of weight. Should it be always set to same values as weight ?
2. I tried changing comparison type to [quote]MULT_UNCERT[quote] and it asked for a [quote]Param1[quote] value, how do we conclude this ?

ray.wurlod · Post by **ray.wurlod** » Tue Apr 12, 2011 4:24 pm

This is a long an complex discussion. Weights (agreement or disagreement) are assigned to each field based on their "information content", or rarity within their own domain. Both m-prob and u-prob are used in this calculation. Logarithms are used to make valid the technique of adding column weights (rather than multiplying probabilities).

The cutoffs are set based on your data. Any record with an aggregate weight above the match cutoff will be considered a match (it will be the "master" record if it's the highest weight in its "block"). Any record with an aggregate weight below the clerical cutoff will be considered a non-match for the pass. Any record with an aggregate weight between the two values will be considered for "clerical review" - inspection by a subject matter expert.

The initial weights for any individual record are determined by comparing it to itself. Blocks - determined by blocking fields - determine which records will be compared with each other. This is why it's important not to have too many records per block - 100 records means 10000 comparisons, 200 records means 40000 comparisons, and so on.

mdbatra · Post by **mdbatra** » Thu Apr 14, 2011 8:51 am

Are these match specifications stored somewhere in the project directory or any other place in the DS repository.

ray.wurlod · Post by **ray.wurlod** » Thu Apr 14, 2011 4:10 pm

Match specifications and pass definitions are stored in the repository. When you're using Designer you can see the former (they have MAT on their icons) and the latter (they have PAS on their icons).

mdbatra · Post by **mdbatra** » Fri Apr 15, 2011 12:16 am

So stupid of me...
Thanks :D