Cleansing Name

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
chandra.shekhar@tcs.com
Premium Member
Premium Member
Posts: 353
Joined: Mon Jan 17, 2011 5:03 am
Location: Mumbai, India

Cleansing Name

Post by chandra.shekhar@tcs.com »

Hi All,
We have received a requirement from our client where we have to cleanse the Customer Name Field through Datastage.
The Name(VARCHAR(120)) should be considered as Invalid and to be reported in the target if

1) It contains only numbers OR
2) It contains only junk values(or special characters) OR
3) It is null or Empty

The Customer Name is valid if it contains alphanumeric or alphabets or alphanumeric with junk values.
E.g.

Code: Select all

Datastage     -- Valid
Datastage 8.5 -- Valid
8.5           -- Invalid
              -- Invalid
Datasta%^ge   -- Valid
1234          -- Invalid
^&*(          -- Invalid
I have tried using Alpha & Alnum function but getting stuck afterwards.
Thanx and Regards,
ETL User
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

OK... I'll start. "Junk" values? :?

Regardless, you should be able to use the double convert technique to remove all of the good characters and if anything is left then it is "invalid".
-craig

"You can never have too many knives" -- Logan Nine Fingers
wwilliamson
Participant
Posts: 21
Joined: Fri Oct 01, 2010 2:45 pm
Contact:

Post by wwilliamson »

Based on the rules you've given, if there's even a single letter present then the name is valid. If that's the case then I'd do what chulett mentioned, but remove non-alphabetic characters, and if the length is still > 0 then it's a valid name. Also, it might be a good idea to run NullToEmpty() and Trim() on your column data.
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

You may need double convert however I would suggest calculating length of original string and the string after convert (all alphabets)and if the length is same then reject it.

So your derivation should look like

Stage Variable

svOrignalLen=Len(InLink.Name)
svConvertLen=Len(convert("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz","",InLink.Name))

Constraint

svOrignalLen > svConvertLen


The second option using double convert

Len(convert(convert("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz","",InLink.Name),"",InLink.Name))<>0

so this will remove all the junk characters and leave you with alphabets only. So if there is any alphabet, your string is valid

May be this is what Craig meant.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
Rakesh311
Participant
Posts: 36
Joined: Wed Aug 04, 2010 10:53 pm
Location: Banglore
Contact:

Post by Rakesh311 »

For this requirement i feel its better to go with Quality stage if u have.

Write a pattern like *? and *^
*? Which will satisfy only if your input contain Alphabets.
*^ Which will satisfy only if your input contain Number.

Any by using this ruleset in Standardize stage you can achieve output.

Regards
rAKESH
chandra.shekhar@tcs.com
Premium Member
Premium Member
Posts: 353
Joined: Mon Jan 17, 2011 5:03 am
Location: Mumbai, India

Post by chandra.shekhar@tcs.com »

@Rakesh,
I know that Qualitystage is better tool for this requirement, but we have a constraint of doing it by Datastage only.
Thanx and Regards,
ETL User
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Silly requirement. Use the best tool for the job.

Perhaps you can do this in steps. Use Convert() to remove all of the digits and punctuation then check the size of the result. If it is zero then it was 'only numbers' and is invalid. Then from that result upcase it and use Convert() to remove A-Z and any other valid characters from it. If the size doesn't change it would fall into your 'only junk or special characters' category and is invalid. Otherwise... valid.

Off the top of my head, that should get you pretty close.
-craig

"You can never have too many knives" -- Logan Nine Fingers
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

chandra.shekhar@tcs.com wrote:Also I need to trim, do a nullablity check & remove tab/new line characters from the name too.
In 8.5 and above versions you may skip null checks but you need to do trimming and other operations in addition to this as per your requirement.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
chandra.shekhar@tcs.com
Premium Member
Premium Member
Posts: 353
Joined: Mon Jan 17, 2011 5:03 am
Location: Mumbai, India

Post by chandra.shekhar@tcs.com »

@Priya
I'll try it and let you know whats my result.
For the time being I'll tell you what I have done.
I have used 1 stage variable

Code: Select all

svName --> Convert(Convert('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz','',NullToEmpty(DSLink2.NAME)),'',NullToEmpty(DSLink2.NAME))
Then I have used a constraint where I filter the data on the basis of

Code: Select all

(AlNum(svName) = 0 AND Alpha(svName) = 0)
(Both should be false)
Is this correct ?
In 8.5 and above versions you may skip null checks
Can you elaborate, I am using version 8.7..


@Craig,
Not able to view your whole content, :(
Can you please summarize a bit ?
Thanx and Regards,
ETL User
Rakesh311
Participant
Posts: 36
Joined: Wed Aug 04, 2010 10:53 pm
Location: Banglore
Contact:

Post by Rakesh311 »

Okay
Lets take it in a different way,
so your requirement is if any alphabet present in your input column then it is valid.


lets do in 3 steps:
1)convert $ in the input empty.

2) then Covert all alphabets present in the output of 1st step to $,

one suggestion instead of writing all alphabet in both cases , covert input to one any of the case and use that case in your code.

3) then if count of $ in output of step 2 is greater than 0 means you have some alphabet in your input.

Regards,
rAKESH
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

I see you have marked the topic as workaround. Does this mean the solution worked?
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
chandra.shekhar@tcs.com
Premium Member
Premium Member
Posts: 353
Joined: Mon Jan 17, 2011 5:03 am
Location: Mumbai, India

Post by chandra.shekhar@tcs.com »

@Priya
Yes, my solution what I have posted above and your solution are giving me the same output so I think the logic is working now.
Can you explain me when you said that we can skip null checks in 8.5 and above ?
How can we do it ?
Thanx and Regards,
ETL User
jerome_rajan
Premium Member
Premium Member
Posts: 376
Joined: Sat Jan 07, 2012 12:25 pm
Location: Piscataway

Post by jerome_rajan »

priyadarshikunal wrote:In 8.5 and above versions you may skip null checks
That's Incorrect if you mean what I think you mean :?
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn

Life is really simple, but we insist on making it complicated.
Post Reply