address cleansing

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
knowledge
Participant
Posts: 101
Joined: Mon Oct 17, 2005 8:14 am

address cleansing

Post by knowledge »

Hi ,

I have to clease and match address records, one source has ST and ST. and other has STREET , many more like this(like AVE. CIRC.). i have to write routine or or unix script to cleanse this data and spell out so that I can match 2 records coming from different sources ,


i have simple routine :
LINE1_TEXT =Arg1

if (LINE1_TEXT='RD' or LINE1_TEXT='RD.') then Ans='ROAD'
else if LINE1_TEXT='ST' or LINE1_TEXT='ST.' then Ans='STREET'

else Ans=LINE1_TEXT


But when I am passing line 1 of address as an argument, it is not taking the whole string , as it just identifies 'ST' ot 'ST.' as argument .

Is there any function which will take whole line1 column as argument and changes all the words like ST. or LN or N. to STREET or LANE or NORTH,

I looked into ereplace or change function but not sure how to use it in routine ,

Please suggest .

Thanks .
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Re: address cleansing

Post by gateleys »

You can't directly apply an Ereplace or similar functions on the entire Address1 string because it would change "123 STRAIT ST." to "123 STREETRAIT STREET".

This kind of work really requires specialized tools like QualityStage. However, for simplistic solutions, analyze your data. If variations for STREET, like ST or ST. or STR, and AVE or AVE. for AVENUE, etc only appear at the end of the first line of address, then you could use the following code -

Code: Select all

vAddress1 = Trim(UpCase(InputLink.Address1))
*Get the number spaces in Address1
vSpaceCnt = Count(vAddress1)
*Process only those addresses with more than 1 words, so BRAVE does not transform to "BRAVENUE"
If vSpaceCnt > 0 Then  
   *Check Last word
   vLastWord = Field(vAddress1, " ", vSpaceCnt + 1)
   If  vLastWord = "ST" or vLastWord = "ST." or vLastWord = "STR" or vLastWord = "STR." Then "STREET"
   If  vLastWord = "AVE" or vLastWord = "AVE." Then "AVENUE"
   *You can perform other replacements here
End 
gateleys
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Use QualityStage - this is precisely the kind of thing (OK, one of the things) it's designed to do. It even correctly figures out that 125 ST STEVEN ST should be 125 SAINT STEVEN STREET.

The technique is called standardization. A QualityStage standardization task can be included in a DataStage job flow.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Post by Teej »

The amount of time you need to account for all of the variations (ST, St, st, Street, STREET, et cetera) is better done by using QualityStage.

In fact, as part of the 8.x suite, you will have QualityStage within the whole framework, making it so valuable and easier to use.

The alternative solution would require custom coding that would be expensive if you hire a consultant to do so.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

A couple of small but necessary corrections to Teej's post.

Even in version 8.x QualityStage remains separately licensed. That is, you can run DataStage without QualityStage - the data quality stage types remain disabled.

Second, if you are a server-only shop, you do NOT get all the new goodies. You still have to rely on the QualityStage plug-in for server jobs to invoke separately designed version 7.5 style QualityStage tasks.
The new goodies are available only in the parallel execution environment.
(I felt that was a pertinent comment since we're in the Server forum here.)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
knowledge
Participant
Posts: 101
Joined: Mon Oct 17, 2005 8:14 am

Post by knowledge »

Hi All,

Thanks for the replay ,

My manager doesn't want to use quality stage (dont know why) ,
I have to write unix script which will take care of all cleasing ,

I wanted to do it in datastage but it seems it is very difficult to do it in datatstage.


thanks for all ur input
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Post by gateleys »

knowledge wrote:My manager doesn't want to use quality stage (dont know why)
That's probably because your site does not have QualityStage and is not willing to make the investment.
knowledge wrote: I wanted to do it in datastage but it seems it is very difficult to do it in datatstage.
As I and others have pointed out, this is the domain of tools like QualityStage. However, if your address parsing and correction requirements are not very exhaustive, then you can use DataStage. Did you try to implement the code that I posted above?
gateleys
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Who's going to maintain this code into the future as the requirements increase - particularly after you've moved on? Can you guarantee to meet key performance indicators (whatever "performance" is) using a hand-coded approach? What testing regime exists? Do you know what all the possible values to be cleansed are? What would you do with "CR"?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply