Name Standardization - How it works?

sigma · Post by **sigma** » Mon Oct 06, 2008 9:31 am

Hi All

We are a AIX environment and using datastage and quality stage.

We are learning to use standardization within quality stage and have a lot of questions. We are hoping to get some help from this forum.

Thanks in advance for your help.

We are trying a much simpler example to explain as all the activites we tried with our actual data set did not yield what we were expecting but we are not sure if we used it correOutctly either, and hence we will give a simple example

Out data file has one column (FULL Name)

JOHN ADAM
JOHN COLLINS
MARY KATE
WORKER JOHN SMITH
YOUTH ROBERT JONES

We are using the USNAME ruleset

Out fo the box the stage gave the following patterns for the above

FF
F+
FF
WF+
WF+

Our first questions is why does standardization stage not split it into a lastname field.

The output does not have LastName, it has first Name middle name and primary name.

The classification file further has on the top the L for last name but really there are no actual data lines marking L. In other words the classifciation file has lots of first name's specified but no last names there. Is it typical for us to add those last names. In that case it does allow us to write to it so is it normal to make copies of the ruleset and then edit

Or are we missing something very basic

I would have thought that the above names except for the last two are pretty standard US names and I would gotten at least an L pattern inthe output

Please let us know if we are missing something very basic

Our objective really is to override so that all W patterns are dropped.

So in our example if we pass YOUTH JOHN SMITH, it should drop the word YOUTH.

Thanks

ray.wurlod · Post by **ray.wurlod** » Mon Oct 06, 2008 3:36 pm

USNAME out of the box expects a comma in lastname, firstname. You can amend the rule set using various techniques - to drop the W token I would suggest an input pattern override.

sigma · Post by **sigma** » Mon Oct 06, 2008 3:52 pm

So out of the box if it expects lastname, firstname then I will have to format my data to confirm to meet the lastname comma firstname. That means we have to know how to parse and identify lastname and firstname and that defeats the purpose of the tool if we have to know so much details about the data.

We were hoping that tool would standardize at least us-names out of the box into first name and last name as the reason we want to use the tool as we do not want to individual parse the data to find first name , last name format the data and then pass it to stan stage.

Having said all this what are the ways to modify the ruleset if we do not have overrides. Overides I can see where it allows us to modify but otherwise in general by default it says the files are read only. Do we copy the ruleset and then modify the copied ruleset?

ray.wurlod · Post by **ray.wurlod** » Mon Oct 06, 2008 5:23 pm

You DO have overrides. It is accessed from the Rules menu.

There are limits. For example, how should you parse FF (for example ELTON JOHN)? The default out of the box is {FN} {LN}.

Run the Rules Analyzer over WORKER JOHN SMITH (or anything else that generates WF+ pattern). What fields do the data get parsed into? Why is that not acceptable? Where it's not, that's where you use overrides.

Use the Rules Analyzer to get a better feel for how the parser is working with the out-of-the-box rule set. If you don't like what comes out of the box then adapt, either with overrides or by creating an alternative Rule Set using USNAME as your prototype.

You don't need to format the data - that's what Standardize stage does.

stuartjvnorton · Post by **stuartjvnorton** » Mon Oct 06, 2008 7:13 pm

sigma wrote:So out of the box if it expects lastname, firstname then I will have to format my data to confirm to meet the lastname comma firstname. That means we have to know how to parse and identify lastname and firstname and that defeats the purpose of the tool if we have to know so much details about the data.

We were hoping that tool would standardize at least us-names out of the box into first name and last name as the reason we want to use the tool as we do not want to individual parse the data to find first name , last name format the data and then pass it to stan stage.

Having said all this what are the ways to modify the ruleset if we do not have overrides. Overides I can see where it allows us to modify but otherwise in general by default it says the files are read only. Do we copy the ruleset and then modify the copied ruleset?

To start, there is no specific LastName field, because this ruleset also handles organisational names. The complete organisational name goes in the PrimaryName field.
Individual names use PrimaryName field to hold the last name.

There are no last names specified in the classification file because the USNAME doesn't try to come up with a list of last names. There are far too many to even attempt to nail down like that.

What it does do is attempt to work out the last name using a process of elimination and by using the other words as a guide.
The L type you are talking about is for last name PREFIXES eg Van, Der, San, De, etc. These words suggest that the word[s] that follow it is the last name.

eg:
F+ -> FirstName(F) LastName(+)
+,F -> LastName(+) , FirstName(F)
FL+ -> FirstName(F) LastName (L+)

Out of the box it does handle a lot more than just +,F so you can just run it straight through and it should do fine for most things.
From your example it looks like you're using it on a list of individuals, use the Process as Individual option when you hook up the STAN step to force it to treat all data as a person. That may also make it work a bit better for your needs.

As for taking YOUTH, WORKER out of the classification file, it's probably not a good idea. What if there is an organisation name with either of those words in it? Maybe not for this situation, but for others.
Unless you're sure that the words are wrong in all situations and removing them is ok, or change the PAT file so that it drops those words in very specific circumstances and behaves normally otherwise, it's a bit risky.
Using an input pattern override is the easiest, as Ray says.

That said, if you want to actually change the ruleset beyond overrides, you are right: you will need to copy and rename the ruleset to be able to change CLS or PAT files.
Note that the length of the name must still be 8 characters or less. The designer might let you do otherwise and it works in the tester window, but it won't work in a job.

Hope this helps.

sigma · Post by **sigma** » Wed Oct 08, 2008 3:33 pm

Thanks a lot for the suggestions.

I am a little behind but will probably digest all this to try out a few scenarios later this week

I will post back what I find out

Thanks again for the explanation it helps

Regards
Arvind