Custom rule sets

kennyapril · Post by **kennyapril** » Tue Nov 02, 2010 10:13 am

I need to do a batch job using custom rule sets and I am new to the quality stage.

The rule sets which are in the tool right now are out of box so I have to create new custom rule sets.

please suggest me how to crate new custom rule sets?

ray.wurlod · Post by **ray.wurlod** » Tue Nov 02, 2010 2:32 pm

In what way "custom"? If you're interested in adapting one of the existing out-of-the-box domain-specific rule sets, you may be able to get away with using overrides. Next easiest is copying the rule set and modifying, say, the classification table. Modifying the pattern-action language requires some skills that you will have to acquire. Least easy is creating an entire rule set (classification, dictionary, pattern-action language, lookup tables, descriptor) from scratch.

stuartjvnorton · Post by **stuartjvnorton** » Tue Nov 02, 2010 5:51 pm

Warning: the following post may cause drowsiness. Do not operate heavy machinery after reading this...

First thing: what do you need this job to do? Writing a custom ruleset takes a bit of time. Do you actually need one?

Next: is the data about name, address, locality, email, phone or tax id (US only for the last 2)? If yes, then you can use a standard one and then use overrides as Ray said to fill the gaps.

If it's not (eg: product descriptions like raw materials or cars, etc), then you'll need to write your own.

Do you understand the rules for this domain? Then you will know what sort of terms you will need to recognise or group and what output fields you will need.

I'm sure there are a few ways to skin this cat, but here's how I'd do it (get the QS Users Guide and Pattern Action Language Reference at this point).

1: Make a copy of a ruleset. If one of the standard ones is kind of close, hen use it to save time. Rename it to something decent, and 8 chars or less.

2: If it's totally different, empty out the CLS file. Keep the first line and keep the lines starting with a ';' (the comments. Helps you to see the structure and prompts you so self-document.

3: Create a job to do a word investigate on the data you're looking to process, using the new ruleset copy you created. Make sure you have selected the Token Frequency report: it's the important one at this stage.

4: Click on Advanced. Check all of the options related to unknown words, numbers, punctuation, complex terms, etc. Also select the option to return the original spelling (very important, that one). It will produce a lot of extra results, but you want to see it. Clear out all of the characters in the Strip list and just put a space there. In the Separation list, put any other punctuation they might have left out.

5: Run the job and check the results. Ignore the punctuation and move it to the side for the time being. Look at the words and group them by meaning. For cars, you can pick out the colours, make, model, etc, work out a standard way of spelling or abbreviating each term (the standard form) and give them a letter (the classification type). These become the entries in your CLS file.
For things like sizes or volumes, just take a note of them and try to understand what they mean and where they fit, so you can process them later in the PAT file.

6: Take all of the punctuation and see how it is used in the real data. Understanding this will tell you what needs to go into the Separation and Strip lists. In short, Sep list decides how to split up the text into tokens. The Strip list gets rid of stuff that will make your rules more complicated. It's a balancing act between keeping the characters that help you to understand what the data means, and removing enough so that you don't have 1000 rules for every dumb comma that someone put in.

7: Put all of the terms you identified into the CLS file. Order them by the classification type, standard form, and the original value. If they are long words that are easy to misspell, use a comparison value to allow some fuzzy matching in the term recognition (Check the User Guide out for this).

8: Put your updated Separation and Strip lists into the PAT file at the appropriate lines near the top.

Save the ruleset and provision all.

9: In your investigate job, add the pattern report if you removed it the first time. Reset the Sep and Strip lists to the default (ie the one that are now in the PAT file). Now you will see the words get classified and the patterns become more usable.

Do steps 5-9 until you're happy with the results.
Note: you don't need to classify every single word there. You just need to pick out the ones that provide the framework. eg: For address, you classify directions, street types, unit types, etc, but don't need to classify the street names, because it's not feasible. Having a number followed by an unknown word followed by a street type (ie: ^ + T ) tells you everything you need to know to make sense of the unknown word. You just need to understand the rules.

10: Fill out the DCT file with the fields you want to output, making sure they are long enough to handle the results. You can do this earlier, but it isn't necessary until now.

11: Take a ruleset like USADDR and understand how it uses rules to process the patterns.
Slashes, Hyphens, etc take punctuation that may add value and process them for specific purposes, and then you remove the rest (that don't add value). Lets you get the context out, but make things easier later.
Common_Patterns knocks out the majority that fall into a couple of basic rules.
Other subroutines process certain parts of the data and remove that bit, leaving a more simple pattern to work with further. eg: for USADDR, Levels and Units do specific processing and leave a simpler patter for later rules to sort out.

12: Add your own rules in the PAT file. This is where the PAL Reference Guide is your bible.
I tend to remove the existing text and replace it in chunks, to keep the overall structure.
Handle rules as per general points in step 11.
If you have specific patterns that don't behave the same as the "normal" ones, handle them early to avoid having them parsed incorrectly by the "default" behaviour.

13: Create a STAN job using your ruleset and look at how it parses (or doesn't parse) your data. What does it do right? Wrong? What doesn't it do at all?
Look to the InputPattern and UnhandledPattern along with the UnhandledData. Check the ones that it does fully parse, to make sure it's doing it properly.
You will start to see the gaps. Maybe you missed a term or it has a couple of meanings. You can fix that in the CLS file or the PAT file (check Multiple_Semantics in AUADDR). Maybe there is a specific part of an InputPattern that is causing problems. You could do a subroutine for it, like Units.

Rinse, repeat, etc.

Um, that's about it, I think. Can't think of anything else at the moment.

Cheers,
Stuart.

ray.wurlod · Post by **ray.wurlod** » Wed Nov 03, 2010 12:19 am

kennyapril wrote:I need to do a batch job using custom rule sets...

Now, after my short response and Stuart's long one, are you SURE you want/need to build a custom rule set? As noted, you may be able to get away with overrides if you're in a name/address/area domain.

kennyapril · Post by **kennyapril** » Wed Nov 03, 2010 2:49 pm

Thanks for the information!!

In a project for the match purpose we are using validity with some rule sets in it.

I have to create the same rule sets based on the rule sets of validity so that that could improve the match weight.

so my task is to develop a job and service that uses the CASS server module and to augment the match process, from what it was initially implemented, improving match rate with custom defined rule sets.

please suggest me based on the above requirement!!

stuartjvnorton · Post by **stuartjvnorton** » Wed Nov 03, 2010 4:41 pm

kennyapril wrote:Thanks for the information!!

In a project for the match purpose we are using validity with some rule sets in it.

I have to create the same rule sets based on the rule sets of validity so that that could improve the match weight.

so my task is to develop a job and service that uses the CASS server module and to augment the match process, from what it was initially implemented, improving match rate with custom defined rule sets.

please suggest me based on the above requirement!!

Originally I saw CASS and thought start with USADDR and use overrides for the gaps.
Now I just have more questions.

Are you trying to improve the number of CASS-validated addresses you have, or are you trying to use CASS (and now some extra processing with a custom ruleset) to massage inputs into an existing match process you have?

Forget custom rulesets, validity, etc for a second.
What is the problem you are trying to solve?
Can you spell out what data you have, what you're trying to match it to, how it currently does it (at a high level obviously) and what the point of this new job is?
We'll have a better chance of helping if we understand the aim of this new job, not just answer the questions you are actually asking.

Cheers,
Stuart.

kennyapril · Post by **kennyapril** » Thu Nov 04, 2010 9:22 am

First thing to do is to create a service which has a request of address and I have to use CASS for the accuracy of address and the response from that service would be the standardized addresses from CASS.

it goes like...the address provided would be...

1 MICROWSOFT
REDMUND WA

and I had to send all the standardized addresses which match the above address as response.

please suggest me the way to do it or the stages to use

Thanks

JRodriguez · Post by **JRodriguez** » Thu Nov 04, 2010 11:26 am

Kennyapril,

Reading through your post look like you just need a QS job with a CASS std stage exposed as a service using ISD

CASS don't return all addresses from the USPS reference database that match the address provided. It just return the input address with the enhancement from the process when a match is found in the reference database

Also be aware that when using CASS in a ISD job you must set the maximum runtime parameter to 10000 for the job. This allow to restart the job every 2 1/2 hours and ensures that the job retrieves the most current CASS data that is installed. (CASS data expired every 105 days)

The good news is that an ISD job could use expired data (This is the only way to reuse an expired database ) :D

kennyapril · Post by **kennyapril** » Thu Nov 04, 2010 12:19 pm

Thanks,

OK, I need CASS stage to correct the provided address and then a standardize the address which is corrected because the response form the service has to be all the standardized addresses from the corrected address.

When CASS data expires every 105 days does it automatically load the new data after 105 days.

The runtime parameter should be set 10000 for the job,Is that in the ISD application?

JRodriguez · Post by **JRodriguez** » Thu Nov 04, 2010 1:05 pm

Kennyapril,

I'm not quite sure what you men with below:

"OK, I need CASS stage to correct the provided address and then a standardize the address which is corrected because the response form the service has to be all the standardized addresses from the corrected address."

When CASS data expires you should download the new database from IBM website and install it. Due to the way ISD jobs works when using CASS the database is not refreshed, is just read one time and from that point will be static, having the runtime parameter set will ensure that the ISD job shutdown and restart at certain intervals, so this way the database got refreshed

You can read more about this in Eostic's post in the IBM SOA Edition forum

Yes you need to set the runtime parameter in ISD

kennyapril · Post by **kennyapril** » Thu Nov 04, 2010 1:26 pm

sure I will look at eostic's post.

what I meant was the response for the service is not onlyto correct the address but also standardize it. so I would use two stages CASS and standardize.

If I set the parameter set it would refresh by itself. Is this right?

Thanks

JRodriguez · Post by **JRodriguez** » Thu Nov 04, 2010 4:17 pm

As part of the CASS process the data will be standardized. CASS is just a very specialized standardization stage

It is a good idea to send the input address the best that you can to the CASS process, you will increase the matching posibilities against the reference database. But if you will be exposing this process as a service then that will increase the latency/response time. CASS is a very intensive process

Also, If I recall correctly there was an issue having a STD stage and CASS together in same job.....

kennyapril · Post by **kennyapril** » Thu Nov 04, 2010 7:49 pm

Thanks,

Actually when an address is entered as a request

the address should be corrected and then standardized.

once it is standardized the address splits into some forms of addresses.

there is no need match against the database in this service,only thing is correct the address and standardize and send what ever you get after standardizing.

so suggest me stages to use for the above output

stuartjvnorton · Post by **stuartjvnorton** » Thu Nov 04, 2010 10:46 pm

kennyapril wrote:Thanks,

Actually when an address is entered as a request

the address should be corrected and then standardized.

once it is standardized the address splits into some forms of addresses.

there is no need match against the database in this service,only thing is correct the address and standardize and send what ever you get after standardizing.

so suggest me stages to use for the above output

I think Julio has already answered this.

Is there something you don't like about the CASS output fields? Doesn't match an interface spec you've been given?

If your spec is more granular than what CASS gives you, you could take the output street fields from CASS and put them through USADDR to get the most granular level. I can't imagine there will be an issue with the locality fields, but it so, put them through USAREA.

If that is still different from your output interface spec, then use a Transformer stage to push the USADDR output fields together so that they fit.

Have you created a basic job to see what CASS alone gives you?

kennyapril · Post by **kennyapril** » Fri Nov 05, 2010 12:14 pm

thanks for the information.

designed a sample job when a source which has all the addresses and used CASS stage to a target sequential file also used the reference database from opt/IBM/cass path.
I got an error which says

"QualityStage Compressed USPS data files are older than 105 days"

"New files must be obtained (USPS requires that no processing be allowed).
error opening files for reader"

Did a search but did not find these words.

please suggest

DSXchange

Custom rule sets

Custom rule sets

Re: Custom rule sets