Creating Custom Rule Sets in Quality stage

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

Welcome!

There is not an out of the box rule set to match company names, there is a rule set -USNAME- that allow you to standardized company names removing all misspelling ...

Do you want just to standardized the data or you want to match it against a master?

There are good examples in the IBM Red Book "IBM WebSphere QualityStage Methodologies, Standardization, and Matching"
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There are two DVDs that are available from DSXchange Learning Center - one on Pattern Action Language and the other on creating rule sets. But I agree with JRodriguez that most of what you appear to want to do can be accomplished out of the box.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Jyothee
Participant
Posts: 8
Joined: Thu Jun 18, 2009 11:58 pm

Post by Jyothee »

JRodriguez wrote:Welcome!

There is not an out of the box rule set to match company names, there is a rule set -USNAME- that allow you to standardized company names removing all misspelling ...

Do you want just to standardized the data or you want to match it against a master?

There are good examples in the IBM Red Book "IBM WebSphere QualityStage Methodologies, Standardization, and Matching"
Hi JRodriguez,

I need to do both standardization and matching against a set of master corporations.I am using Reference Match stage for getting matched records.But here i am getting only exact matches like if i have "Bank of America" as corporation name in both source and master i am getting that record . if any one of this is modified like "Bank of America Us" or something like that i am getting that as an unmatched record.

Can you please let me know like what i need to specify in Match specification for getting like matches with words.

Thanks,
jyothi
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

I will assume that both input and Master reference file has been standardized ..

To match "like" records basically you define a blocking criteria using common attributes between the Master file and the input file like "MatchPrimaryWord1NYSIIS_USNAME and the first character of "MatchPrimaryWord2NYSIIS_USNAME" or any combination of fields that you think should bring all "like" records into same block. Select the character comparison ....

In the match command used fields that could have difference in content. A good idea is to used those fields generated in the Standardization process: PrimaryName_USNAME, MatchPrimaryNamePackKey_USNAME ... etc

The best match types for multi words comparison are: MULTI_UNCERT, MULTI_ALIGN, and UNCERT but you can used any other match type that fulfills your match requirements

Set the Match, Clerical and Duplicates cut off values. I suggest that you set those values to zero the first time, run the Test Pass, generate the histogram, and base on the graphics find out at which composite weight value the records become match, duplicates and residual....and then set the cut off values and run the Test Pass again until you are comfortable with your match results


Probably you will need more than one passes to get final results ...for the second, and third pass used a different blocking criteria each time, you can used same matching commands, and repeat the previous step...

Have fun!
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
Jyothee
Participant
Posts: 8
Joined: Thu Jun 18, 2009 11:58 pm

Post by Jyothee »

Hi JRodriguez,

When i found matched records it is showing weight as 0 in the test results.Whare can i set weights for all the fields in both data and reference columns.what is the match type that i need to take in Match specification.As of now it is Reference.How can we set m_prob and u_prob for a particular match command.Is that just a guess work.Is that same with Cutoff also.

Thanks,

JRodriguez wrote:I will assume that both input and Master reference file has been standardized ..

To match "like" records basically you define a blocking criteria using common attributes between the Master file and the input file like "MatchPrimaryWord1NYSIIS_USNAME and the first character of "MatchPrimaryWord2NYSIIS_USNAME" or any combination of fields that you think should bring all "like" records into same block. Select the character comparison ....

In the match command used fields that could have difference in content. A good idea is to used those fields generated in the Standardization process: PrimaryName_USNAME, MatchPrimaryNamePackKey_USNAME ... etc

The best match types for multi words comparison are: MULTI_UNCERT, MULTI_ALIGN, and UNCERT but you can used any other match type that fulfills your match requirements

Set the Match, Clerical and Duplicates cut off values. I suggest that you set those values to zero the first time, run the Test Pass, generate the histogram, and base on the graphics find out at which composite weight value the records become match, duplicates and residual....and then set the cut off values and run the Test Pass again until you are comfortable with your match results


Probably you will need more than one passes to get final results ...for the second, and third pass used a different blocking criteria each time, you can used same matching commands, and repeat the previous step...

Have fun!
jyothi
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

When i found matched records it is showing weight as 0 in the test results.
A: The match commands are not set properly ....

Whare can i set weights for all the fields in both data and reference columns.
A: The weights will be calculated by QS tool automatically after filling up the proper values. The weights for all fields (composite weight) for both data and reference file are calculated base on the m and u prob and frequency info and other stuff

what is the match type that i need to take in Match specification.

A: In your case for multi words comparison I would used MULTI_UNCERT, MULTI_ALIGN, or UNCERT

How can we set m_prob and u_prob for a particular match command.Is that just a guess work.Is that same with Cutoff also.

A: Is not a guess work

The m prob come with your requirements, the businnes rule should tell what they will consider a good match and the probs. The U is easier, fill up any value and the tool will take care of using the proper value ...

If they don't know try m=.9 90%

The Cut off values in the other hand, should be set for each pass. The first time that you run the Test Pass is a good idea to set them all to zero just to get the Histogram of the pass. The Histogram will show when the pairs become matches, duplicates and residual. Then you pick the weight form the graphics and replace the cut off values ...and run the Test Pass again ... until all data records matching properly with the reference records ...

The IBM Quality Stage user guide explain this topic in details
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
Jyothee
Participant
Posts: 8
Joined: Thu Jun 18, 2009 11:58 pm

Post by Jyothee »

Hi JRodriguez,

Thanks for the greatest support and for the patience in answering my Questions.Now its working for me .I just used a wrong blocking criteria.
Now i changed it.

Thanks,
JRodriguez wrote:When i found matched records it is showing weight as 0 in the test results.
A: The match commands are not set properly ....

Whare can i set weights for all the fields in both data and reference columns.
A: The weights will be calculated by QS tool automatically after filling up the proper values. The weights for all fields (composite weight) for both data and reference file are calculated base on the m and u prob and frequency info and other stuff

what is the match type that i need to take in Match specification.

A: In your case for multi words comparison I would used MULTI_UNCERT, MULTI_ALIGN, or UNCERT

How can we set m_prob and u_prob for a particular match command.Is that just a guess work.Is that same with Cutoff also.

A: Is not a guess work

The m prob come with your requirements, the businnes rule should tell what they will consider a good match and the probs. The U is easier, fill up any value and the tool will take care of using the proper value ...

If they don't know try m=.9 90%

The Cut off values in the other hand, should be set for each pass. The first time that you run the Test Pass is a good idea to set them all to zero just to get the Histogram of the pass. The Histogram will show when the pairs become matches, duplicates and residual. Then you pick the weight form the graphics and replace the cut off values ...and run the Test Pass again ... until all data records matching properly with the reference records ...

The IBM Quality Stage user guide explain this topic in details
jyothi
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Please mark this thread as Resolved using the green button at top.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply