Matching Question

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
dalecooper
Participant
Posts: 20
Joined: Tue May 13, 2008 1:24 am
Location: Dale Cooper

Re: Matching Question

Post by dalecooper »

Hi Julio

Thanks for the reply!
Unfortunately it does not work on my side.
The closest I get to the desired results is not to have
AddedColumn in the blocking but to include it in
the match specification, UNCERT 900, 0.9 and 0.1.
Critical Missing OK

For this example I have standardized the CompanyName.

So The columns look similar to this:

MatchName1|MatchName2|DivisionNumber|Addedcolumn|
IBM | | 01 | 01 |
IBM | Corp | 01 | 01 |
IBM | CC | 01 | |
IBM | Pty | 01 | 07 |
IBM | | 01 | 07 |

Blocking on MatchName1 and DivisionNumber.
And I agree that blocking on MatchName1 and DivisionNumber and
matching on AddedColumn UNCERT 0.9 and 0.1 , 900 and Critical Missing OK
should group it correctly but it does not.
The above code is used in the seventh pass.

I get results like this:
Pass6:
MatchName1|MatchName2|DivisionNumber|Addedcolumn|
IBM | | 01 | 01 |
IBM | CC | 01 | |

Pass7:
MatchName1|MatchName2|DivisionNumber|Addedcolumn|
IBM | Corp | 01 | 01 |
IBM | Pty | 01 | 07 |
IBM | | 01 | 07 |

In Pass7 there is 01 also included!

Any suggestions?

Thanks
Dale
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Re: Matching Question

Post by JRodriguez »

Dale,

Which column in your input data is having a null value? I can see it .......
maybe if you use "Code" it will be properly tabulated ...

If you are using just AddedColumn as your Blocking column and one of the records is having a null value in this column then that record will be automatically rejected but it will available for other passes ...

If the AddedColumn is having a null value, I guess that you will need two passes, please use independent match:

1) Block on Added column
Nothing on Math commands!

Results should be like:

MatchName1|MatchName2|DivisionNumber|Addedcolumn|
IBM | | 01 | 01 |
IBM | Corp | 01 | 01 |


IBM | Pty | 01 | 07 |
IBM | | 01 | 07 |

Residual record ( will be available for second pass)
IBM | CC | 01 | |


2) Block on DivisionNumber and First Character of MatchName2
Nothing on Match commands!

This will add the residual record from pass1 to top group of records

MatchName1|MatchName2|DivisionNumber|Addedcolumn|

IBM | | 01 | 01 |
IBM | Corp | 01 | 01 |
IBM | CC | 01 | |



IBM | Pty | 01 | 07 |
IBM | | 01 | 07 |



dalecooper wrote:Hi Julio

Thanks for the reply!
Unfortunately it does not work on my side.
The closest I get to the desired results is not to have
AddedColumn in the blocking but to include it in
the match specification, UNCERT 900, 0.9 and 0.1.
Critical Missing OK

For this example I have standardized the CompanyName.

So The columns look similar to this:

MatchName1|MatchName2|DivisionNumber|Addedcolumn|
IBM | | 01 | 01 |
IBM | Corp | 01 | 01 |
IBM | CC | 01 | |
IBM | Pty | 01 | 07 |
IBM | | 01 | 07 |

Blocking on MatchName1 and DivisionNumber.
And I agree that blocking on MatchName1 and DivisionNumber and
matching on AddedColumn UNCERT 0.9 and 0.1 , 900 and Critical Missing OK
should group it correctly but it does not.
The above code is used in the seventh pass.

I get results like this:
Pass6:
MatchName1|MatchName2|DivisionNumber|Addedcolumn|
IBM | | 01 | 01 |
IBM | CC | 01 | |

Pass7:
MatchName1|MatchName2|DivisionNumber|Addedcolumn|
IBM | Corp | 01 | 01 |
IBM | Pty | 01 | 07 |
IBM | | 01 | 07 |

In Pass7 there is 01 also included!

Any suggestions?

Thanks
Dale
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
dalecooper
Participant
Posts: 20
Joined: Tue May 13, 2008 1:24 am
Location: Dale Cooper

Post by dalecooper »

Hi Julio

Thank you again for the help!

I have tried this and it works, however don't you think it
will also be possible to modify the "Weight Overrides" and add a value
to the "AddedColum" to ensure that if the data agrees/ do not agree then
it should put it in different blocks?

The "Weight Overrides" screen has the following options:

Replace/Add Radio buttons.
Agreement Weight[AW]
Disagreement Weight[DW]
Data Source Missing Weight[AM]
Reference Source Missing Weight[BM]
Both Missing Weight[XM]
Conditional Source Value[AV]
Conditional Reference Source Value[BV]

So, if I select "Add" and put -15 in the "Disagreement Weight" this might block the data into different blocks?

Thanks
Dale
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Please be careful with terminology.

Blocking columns MUST acheive an exact match.

Data are grouped into "sets", each with a unique Set ID, to indicate potential duplicates based on your other criteria.

Increasing the disagreement weight is more likely to reduce the number of potential duplicates.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dalecooper
Participant
Posts: 20
Joined: Tue May 13, 2008 1:24 am
Location: Dale Cooper

Post by dalecooper »

Thanks Ray for the information.

Where can I find what each of the following means and how to set it up?

Replace/Add Radio buttons.
Agreement Weight[AW]
Disagreement Weight[DW]
Data Source Missing Weight[AM]
Reference Source Missing Weight[BM]
Both Missing Weight[XM]
Conditional Source Value[AV]
Conditional Reference Source Value[BV]

I have tried numerous settings for the above but cannot get the correct results?
Is there maybe a more detailed description somewhere on what the above does?
Basically what I need is that when a certain column does not match it needs to be allocted to different buckets?

Example :
(This is the third pass)
AddedColumnToMatch:
Same Pass
Matched set, SetId 1 :
07
07
null

Matched set, SerId2 :
03
03

Thanks
Dale
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

Dale,

You can find the meaning of those weight override variables in the WebSphere QualityStage User Guide pdf that comes with the product and the online help of the product itself. Learning how to set them up after getting the sense of how the tool used composite weights and cut off values to determine match, duplicates, and residual pairs, is just a matter of practicing different predetermined scenarios like you are doing


dalecooper wrote:Thanks Ray for the information.

Where can I find what each of the following means and how to set it up?

Replace/Add Radio buttons.
Agreement Weight[AW]
Disagreement Weight[DW]
Data Source Missing Weight[AM]
Reference Source Missing Weight[BM]
Both Missing Weight[XM]
Conditional Source Value[AV]
Conditional Reference Source Value[BV]

I have tried numerous settings for the above but cannot get the correct results?
Is there maybe a more detailed description somewhere on what the above does?
Basically what I need is that when a certain column does not match it needs to be allocted to different buckets?

Example :
(This is the third pass)
AddedColumnToMatch:
Same Pass
Matched set, SetId 1 :
07
07
null

Matched set, SerId2 :
03
03

Thanks
Dale
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

Dale,

Yes is possible ... but adding,replacing and scaling weights only modify the calculated composite weight of the pair, the key factor if you take that route, will be the cut off value of the pass that will determine if the pair is a match or non match

If you want a record to be considered automatically a non match base on a column disagreement weight ( In your case "AddedColumn") then you should declare the column as CRITICAL [MISSINGOK] in the Special Variable Properties ( VARTYPE )
dalecooper wrote:Hi Julio

Thank you again for the help!

I have tried this and it works, however don't you think it
will also be possible to modify the "Weight Overrides" and add a value
to the "AddedColum" to ensure that if the data agrees/ do not agree then
it should put it in different blocks?

The "Weight Overrides" screen has the following options:

Replace/Add Radio buttons.
Agreement Weight[AW]
Disagreement Weight[DW]
Data Source Missing Weight[AM]
Reference Source Missing Weight[BM]
Both Missing Weight[XM]
Conditional Source Value[AV]
Conditional Reference Source Value[BV]

So, if I select "Add" and put -15 in the "Disagreement Weight" this might block the data into different blocks?

Thanks
Dale
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Julio - not to be a 'noodge' but you are aware there's no need to quote the entire contents of every post you reply to, yes? There is a perfectly lovely Reply to topic button that starts you off with a clean slate.
-craig

"You can never have too many knives" -- Logan Nine Fingers
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

I did not notice this buttom until you point me to it, thanks!
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

:(
I did not notice this buttom until you point me to it, thanks!
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
dalecooper
Participant
Posts: 20
Joined: Tue May 13, 2008 1:24 am
Location: Dale Cooper

Post by dalecooper »

Well, thanks for all the information.
I'm sure from here I will able to manage the changes on my side!

Your help Julio/Ray is much appreciated.
Post Reply