Welcome!
There is not an out of the box rule set to match company names, there is a rule set -USNAME- that allow you to standardized company names removing all misspelling ...
Do you want just to standardized the data or you want to match it against a master?
There are good examples in the IBM Red Book "IBM WebSphere QualityStage Methodologies, Standardization, and Matching"
Creating Custom Rule Sets in Quality stage
-
- Premium Member
- Posts: 425
- Joined: Sat Nov 19, 2005 9:26 am
- Location: New York City
- Contact:
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
There are two DVDs that are available from DSXchange Learning Center - one on Pattern Action Language and the other on creating rule sets. But I agree with JRodriguez that most of what you appear to want to do can be accomplished out of the box.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Hi JRodriguez,JRodriguez wrote:Welcome!
There is not an out of the box rule set to match company names, there is a rule set -USNAME- that allow you to standardized company names removing all misspelling ...
Do you want just to standardized the data or you want to match it against a master?
There are good examples in the IBM Red Book "IBM WebSphere QualityStage Methodologies, Standardization, and Matching"
I need to do both standardization and matching against a set of master corporations.I am using Reference Match stage for getting matched records.But here i am getting only exact matches like if i have "Bank of America" as corporation name in both source and master i am getting that record . if any one of this is modified like "Bank of America Us" or something like that i am getting that as an unmatched record.
Can you please let me know like what i need to specify in Match specification for getting like matches with words.
Thanks,
jyothi
-
- Premium Member
- Posts: 425
- Joined: Sat Nov 19, 2005 9:26 am
- Location: New York City
- Contact:
I will assume that both input and Master reference file has been standardized ..
To match "like" records basically you define a blocking criteria using common attributes between the Master file and the input file like "MatchPrimaryWord1NYSIIS_USNAME and the first character of "MatchPrimaryWord2NYSIIS_USNAME" or any combination of fields that you think should bring all "like" records into same block. Select the character comparison ....
In the match command used fields that could have difference in content. A good idea is to used those fields generated in the Standardization process: PrimaryName_USNAME, MatchPrimaryNamePackKey_USNAME ... etc
The best match types for multi words comparison are: MULTI_UNCERT, MULTI_ALIGN, and UNCERT but you can used any other match type that fulfills your match requirements
Set the Match, Clerical and Duplicates cut off values. I suggest that you set those values to zero the first time, run the Test Pass, generate the histogram, and base on the graphics find out at which composite weight value the records become match, duplicates and residual....and then set the cut off values and run the Test Pass again until you are comfortable with your match results
Probably you will need more than one passes to get final results ...for the second, and third pass used a different blocking criteria each time, you can used same matching commands, and repeat the previous step...
Have fun!
To match "like" records basically you define a blocking criteria using common attributes between the Master file and the input file like "MatchPrimaryWord1NYSIIS_USNAME and the first character of "MatchPrimaryWord2NYSIIS_USNAME" or any combination of fields that you think should bring all "like" records into same block. Select the character comparison ....
In the match command used fields that could have difference in content. A good idea is to used those fields generated in the Standardization process: PrimaryName_USNAME, MatchPrimaryNamePackKey_USNAME ... etc
The best match types for multi words comparison are: MULTI_UNCERT, MULTI_ALIGN, and UNCERT but you can used any other match type that fulfills your match requirements
Set the Match, Clerical and Duplicates cut off values. I suggest that you set those values to zero the first time, run the Test Pass, generate the histogram, and base on the graphics find out at which composite weight value the records become match, duplicates and residual....and then set the cut off values and run the Test Pass again until you are comfortable with your match results
Probably you will need more than one passes to get final results ...for the second, and third pass used a different blocking criteria each time, you can used same matching commands, and repeat the previous step...
Have fun!
Julio Rodriguez
ETL Developer by choice
"Sure we have lots of reasons for being rude - But no excuses
ETL Developer by choice
"Sure we have lots of reasons for being rude - But no excuses
Hi JRodriguez,
When i found matched records it is showing weight as 0 in the test results.Whare can i set weights for all the fields in both data and reference columns.what is the match type that i need to take in Match specification.As of now it is Reference.How can we set m_prob and u_prob for a particular match command.Is that just a guess work.Is that same with Cutoff also.
Thanks,
When i found matched records it is showing weight as 0 in the test results.Whare can i set weights for all the fields in both data and reference columns.what is the match type that i need to take in Match specification.As of now it is Reference.How can we set m_prob and u_prob for a particular match command.Is that just a guess work.Is that same with Cutoff also.
Thanks,
JRodriguez wrote:I will assume that both input and Master reference file has been standardized ..
To match "like" records basically you define a blocking criteria using common attributes between the Master file and the input file like "MatchPrimaryWord1NYSIIS_USNAME and the first character of "MatchPrimaryWord2NYSIIS_USNAME" or any combination of fields that you think should bring all "like" records into same block. Select the character comparison ....
In the match command used fields that could have difference in content. A good idea is to used those fields generated in the Standardization process: PrimaryName_USNAME, MatchPrimaryNamePackKey_USNAME ... etc
The best match types for multi words comparison are: MULTI_UNCERT, MULTI_ALIGN, and UNCERT but you can used any other match type that fulfills your match requirements
Set the Match, Clerical and Duplicates cut off values. I suggest that you set those values to zero the first time, run the Test Pass, generate the histogram, and base on the graphics find out at which composite weight value the records become match, duplicates and residual....and then set the cut off values and run the Test Pass again until you are comfortable with your match results
Probably you will need more than one passes to get final results ...for the second, and third pass used a different blocking criteria each time, you can used same matching commands, and repeat the previous step...
Have fun!
jyothi
-
- Premium Member
- Posts: 425
- Joined: Sat Nov 19, 2005 9:26 am
- Location: New York City
- Contact:
When i found matched records it is showing weight as 0 in the test results.
A: The match commands are not set properly ....
Whare can i set weights for all the fields in both data and reference columns.
A: The weights will be calculated by QS tool automatically after filling up the proper values. The weights for all fields (composite weight) for both data and reference file are calculated base on the m and u prob and frequency info and other stuff
what is the match type that i need to take in Match specification.
A: In your case for multi words comparison I would used MULTI_UNCERT, MULTI_ALIGN, or UNCERT
How can we set m_prob and u_prob for a particular match command.Is that just a guess work.Is that same with Cutoff also.
A: Is not a guess work
The m prob come with your requirements, the businnes rule should tell what they will consider a good match and the probs. The U is easier, fill up any value and the tool will take care of using the proper value ...
If they don't know try m=.9 90%
The Cut off values in the other hand, should be set for each pass. The first time that you run the Test Pass is a good idea to set them all to zero just to get the Histogram of the pass. The Histogram will show when the pairs become matches, duplicates and residual. Then you pick the weight form the graphics and replace the cut off values ...and run the Test Pass again ... until all data records matching properly with the reference records ...
The IBM Quality Stage user guide explain this topic in details
A: The match commands are not set properly ....
Whare can i set weights for all the fields in both data and reference columns.
A: The weights will be calculated by QS tool automatically after filling up the proper values. The weights for all fields (composite weight) for both data and reference file are calculated base on the m and u prob and frequency info and other stuff
what is the match type that i need to take in Match specification.
A: In your case for multi words comparison I would used MULTI_UNCERT, MULTI_ALIGN, or UNCERT
How can we set m_prob and u_prob for a particular match command.Is that just a guess work.Is that same with Cutoff also.
A: Is not a guess work
The m prob come with your requirements, the businnes rule should tell what they will consider a good match and the probs. The U is easier, fill up any value and the tool will take care of using the proper value ...
If they don't know try m=.9 90%
The Cut off values in the other hand, should be set for each pass. The first time that you run the Test Pass is a good idea to set them all to zero just to get the Histogram of the pass. The Histogram will show when the pairs become matches, duplicates and residual. Then you pick the weight form the graphics and replace the cut off values ...and run the Test Pass again ... until all data records matching properly with the reference records ...
The IBM Quality Stage user guide explain this topic in details
Julio Rodriguez
ETL Developer by choice
"Sure we have lots of reasons for being rude - But no excuses
ETL Developer by choice
"Sure we have lots of reasons for being rude - But no excuses
Hi JRodriguez,
Thanks for the greatest support and for the patience in answering my Questions.Now its working for me .I just used a wrong blocking criteria.
Now i changed it.
Thanks,
Thanks for the greatest support and for the patience in answering my Questions.Now its working for me .I just used a wrong blocking criteria.
Now i changed it.
Thanks,
JRodriguez wrote:When i found matched records it is showing weight as 0 in the test results.
A: The match commands are not set properly ....
Whare can i set weights for all the fields in both data and reference columns.
A: The weights will be calculated by QS tool automatically after filling up the proper values. The weights for all fields (composite weight) for both data and reference file are calculated base on the m and u prob and frequency info and other stuff
what is the match type that i need to take in Match specification.
A: In your case for multi words comparison I would used MULTI_UNCERT, MULTI_ALIGN, or UNCERT
How can we set m_prob and u_prob for a particular match command.Is that just a guess work.Is that same with Cutoff also.
A: Is not a guess work
The m prob come with your requirements, the businnes rule should tell what they will consider a good match and the probs. The U is easier, fill up any value and the tool will take care of using the proper value ...
If they don't know try m=.9 90%
The Cut off values in the other hand, should be set for each pass. The first time that you run the Test Pass is a good idea to set them all to zero just to get the Histogram of the pass. The Histogram will show when the pairs become matches, duplicates and residual. Then you pick the weight form the graphics and replace the cut off values ...and run the Test Pass again ... until all data records matching properly with the reference records ...
The IBM Quality Stage user guide explain this topic in details
jyothi
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact: