Setting Cutoff Values

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
mdbatra
Premium Member
Premium Member
Posts: 175
Joined: Wed Oct 22, 2008 10:01 am
Location: City of London

Setting Cutoff Values

Post by mdbatra »

Hi,

I have just started learning QualityStage through the PDF documentation. But i guess i am missing something.

How do we set the Match/Clerical cutoffs. I mean how do we calculate that.

Also, on what basis do we set the m and u probability ?

Would apprciate any help.

Thanks.
Rgds,
MB
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You set the cutoffs at the bottom of the match specification designer screen or, if you are testing the match specification, you can move the sliders on the histogram.

m-probability is driven by the error rate you're prepared to accept. The default (0.9) indicates that you're prepared to accept a 10% error rate - that is, m-probability is (1 - error rate).

u-probability measures the likelihood that, when a match is found, it is found as a result of random chance. The default (0.01) specifies that you believe that 1 in 100 matches will occur due to purely random chance.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mdbatra
Premium Member
Premium Member
Posts: 175
Joined: Wed Oct 22, 2008 10:01 am
Location: City of London

Post by mdbatra »

Thanks for reply.

I observed that cutoff values are specified at the bottom of Match Designer window.

I think it would be good if i provide with you the intent. I need to perform Reference Match(Many-to-one) with the following(sample) data:

Source Data
Col1, Col2
A,International Bus. Machine
A,International B M
A,IBM
A,Inter. Bus. Machine

Referece Data
A,International Business Machine

In Specification, i defined [quote]Col1[quote] as Blocking column and the Abbreviation for [quote]Col2[quote] as Matching Columns( using the standardization process, evaluated as [quote]IBM[quote] here). The comparison type is set to [quote]CHAR[quote]. Match/clerical cutoffs are currently set to 0.

Now, by default m and u probabilities are set to .9 and .01 respectively. When i did a test run( just one pass as defined above), the weight for the matching records is [quote].58[quote]

The things which am not able to figure out are:
1. How is Match/Clerical cutoff derived on the basis of weight. Should it be always set to same values as weight ?
2. I tried changing comparison type to [quote]MULT_UNCERT[quote] and it asked for a [quote]Param1[quote] value, how do we conclude this ?
Rgds,
MB
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

This is a long an complex discussion. Weights (agreement or disagreement) are assigned to each field based on their "information content", or rarity within their own domain. Both m-prob and u-prob are used in this calculation. Logarithms are used to make valid the technique of adding column weights (rather than multiplying probabilities).

The cutoffs are set based on your data. Any record with an aggregate weight above the match cutoff will be considered a match (it will be the "master" record if it's the highest weight in its "block"). Any record with an aggregate weight below the clerical cutoff will be considered a non-match for the pass. Any record with an aggregate weight between the two values will be considered for "clerical review" - inspection by a subject matter expert.

The initial weights for any individual record are determined by comparing it to itself. Blocks - determined by blocking fields - determine which records will be compared with each other. This is why it's important not to have too many records per block - 100 records means 10000 comparisons, 200 records means 40000 comparisons, and so on.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mdbatra
Premium Member
Premium Member
Posts: 175
Joined: Wed Oct 22, 2008 10:01 am
Location: City of London

Post by mdbatra »

Are these match specifications stored somewhere in the project directory or any other place in the DS repository.
Rgds,
MB
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Match specifications and pass definitions are stored in the repository. When you're using Designer you can see the former (they have MAT on their icons) and the latter (they have PAS on their icons).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mdbatra
Premium Member
Premium Member
Posts: 175
Joined: Wed Oct 22, 2008 10:01 am
Location: City of London

Post by mdbatra »

So stupid of me...
Thanks :D
Rgds,
MB
Post Reply