Creating Custom Rule Sets in Quality stage

JRodriguez · Post by **JRodriguez** » Fri Jun 19, 2009 9:14 am

Welcome!

There is not an out of the box rule set to match company names, there is a rule set -USNAME- that allow you to standardized company names removing all misspelling ...

Do you want just to standardized the data or you want to match it against a master?

There are good examples in the IBM Red Book "IBM WebSphere QualityStage Methodologies, Standardization, and Matching"

ray.wurlod · Post by **ray.wurlod** » Fri Jun 19, 2009 11:01 pm

There are two DVDs that are available from DSXchange Learning Center - one on Pattern Action Language and the other on creating rule sets. But I agree with JRodriguez that most of what you appear to want to do can be accomplished out of the box.

Jyothee · Post by **Jyothee** » Tue Jun 23, 2009 8:48 am

JRodriguez wrote:Welcome!

There is not an out of the box rule set to match company names, there is a rule set -USNAME- that allow you to standardized company names removing all misspelling ...

Do you want just to standardized the data or you want to match it against a master?

There are good examples in the IBM Red Book "IBM WebSphere QualityStage Methodologies, Standardization, and Matching"

Hi JRodriguez,

I need to do both standardization and matching against a set of master corporations.I am using Reference Match stage for getting matched records.But here i am getting only exact matches like if i have "Bank of America" as corporation name in both source and master i am getting that record . if any one of this is modified like "Bank of America Us" or something like that i am getting that as an unmatched record.

Can you please let me know like what i need to specify in Match specification for getting like matches with words.

Thanks,

JRodriguez · Post by **JRodriguez** » Tue Jun 23, 2009 10:17 am

I will assume that both input and Master reference file has been standardized ..

To match "like" records basically you define a blocking criteria using common attributes between the Master file and the input file like "MatchPrimaryWord1NYSIIS_USNAME and the first character of "MatchPrimaryWord2NYSIIS_USNAME" or any combination of fields that you think should bring all "like" records into same block. Select the character comparison ....

In the match command used fields that could have difference in content. A good idea is to used those fields generated in the Standardization process: PrimaryName_USNAME, MatchPrimaryNamePackKey_USNAME ... etc

The best match types for multi words comparison are: MULTI_UNCERT, MULTI_ALIGN, and UNCERT but you can used any other match type that fulfills your match requirements

Set the Match, Clerical and Duplicates cut off values. I suggest that you set those values to zero the first time, run the Test Pass, generate the histogram, and base on the graphics find out at which composite weight value the records become match, duplicates and residual....and then set the cut off values and run the Test Pass again until you are comfortable with your match results

Probably you will need more than one passes to get final results ...for the second, and third pass used a different blocking criteria each time, you can used same matching commands, and repeat the previous step...

Have fun!

Jyothee · Post by **Jyothee** » Thu Jun 25, 2009 6:15 am

Hi JRodriguez,

When i found matched records it is showing weight as 0 in the test results.Whare can i set weights for all the fields in both data and reference columns.what is the match type that i need to take in Match specification.As of now it is Reference.How can we set m_prob and u_prob for a particular match command.Is that just a guess work.Is that same with Cutoff also.

Thanks,

JRodriguez wrote:I will assume that both input and Master reference file has been standardized ..

To match "like" records basically you define a blocking criteria using common attributes between the Master file and the input file like "MatchPrimaryWord1NYSIIS_USNAME and the first character of "MatchPrimaryWord2NYSIIS_USNAME" or any combination of fields that you think should bring all "like" records into same block. Select the character comparison ....

In the match command used fields that could have difference in content. A good idea is to used those fields generated in the Standardization process: PrimaryName_USNAME, MatchPrimaryNamePackKey_USNAME ... etc

The best match types for multi words comparison are: MULTI_UNCERT, MULTI_ALIGN, and UNCERT but you can used any other match type that fulfills your match requirements

Set the Match, Clerical and Duplicates cut off values. I suggest that you set those values to zero the first time, run the Test Pass, generate the histogram, and base on the graphics find out at which composite weight value the records become match, duplicates and residual....and then set the cut off values and run the Test Pass again until you are comfortable with your match results

Probably you will need more than one passes to get final results ...for the second, and third pass used a different blocking criteria each time, you can used same matching commands, and repeat the previous step...

Have fun!

JRodriguez · Post by **JRodriguez** » Fri Jun 26, 2009 8:47 am

When i found matched records it is showing weight as 0 in the test results.
A: The match commands are not set properly ....

Whare can i set weights for all the fields in both data and reference columns.
A: The weights will be calculated by QS tool automatically after filling up the proper values. The weights for all fields (composite weight) for both data and reference file are calculated base on the m and u prob and frequency info and other stuff

what is the match type that i need to take in Match specification.

A: In your case for multi words comparison I would used MULTI_UNCERT, MULTI_ALIGN, or UNCERT

How can we set m_prob and u_prob for a particular match command.Is that just a guess work.Is that same with Cutoff also.

A: Is not a guess work

The m prob come with your requirements, the businnes rule should tell what they will consider a good match and the probs. The U is easier, fill up any value and the tool will take care of using the proper value ...

If they don't know try m=.9 90%

The Cut off values in the other hand, should be set for each pass. The first time that you run the Test Pass is a good idea to set them all to zero just to get the Histogram of the pass. The Histogram will show when the pairs become matches, duplicates and residual. Then you pick the weight form the graphics and replace the cut off values ...and run the Test Pass again ... until all data records matching properly with the reference records ...

The IBM Quality Stage user guide explain this topic in details

Jyothee · Post by **Jyothee** » Mon Jun 29, 2009 9:00 am

Hi JRodriguez,

Thanks for the greatest support and for the patience in answering my Questions.Now its working for me .I just used a wrong blocking criteria.
Now i changed it.

Thanks,

JRodriguez wrote:When i found matched records it is showing weight as 0 in the test results.
A: The match commands are not set properly ....

Whare can i set weights for all the fields in both data and reference columns.
A: The weights will be calculated by QS tool automatically after filling up the proper values. The weights for all fields (composite weight) for both data and reference file are calculated base on the m and u prob and frequency info and other stuff

what is the match type that i need to take in Match specification.

A: In your case for multi words comparison I would used MULTI_UNCERT, MULTI_ALIGN, or UNCERT

How can we set m_prob and u_prob for a particular match command.Is that just a guess work.Is that same with Cutoff also.

A: Is not a guess work

The m prob come with your requirements, the businnes rule should tell what they will consider a good match and the probs. The U is easier, fill up any value and the tool will take care of using the proper value ...

If they don't know try m=.9 90%

The Cut off values in the other hand, should be set for each pass. The first time that you run the Test Pass is a good idea to set them all to zero just to get the Histogram of the pass. The Histogram will show when the pairs become matches, duplicates and residual. Then you pick the weight form the graphics and replace the cut off values ...and run the Test Pass again ... until all data records matching properly with the reference records ...

The IBM Quality Stage user guide explain this topic in details

ray.wurlod · Post by **ray.wurlod** » Mon Jun 29, 2009 4:45 pm

Please mark this thread as Resolved using the green button at top.