String Tokenizer

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
rameshrr3
Premium Member
Premium Member
Posts: 609
Joined: Mon May 10, 2004 3:32 am
Location: BRENTWOOD, TN

String Tokenizer

Post by rameshrr3 »

Is there any function in Basic similar to String tokenizer in java or strtok in c/c++ ?

Im trying to parse a list of text values ( like a regular language sentence) and replace all Non Alpha numeric characters with a single Non Alphanumeric character(per occurence) .

Ereplace() is very specific. Convert() requires me to supply all special chars in advance and specify as many replacement characters. I'm looking at something more simpler - if such a function exists .
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

why not use convert multiple times something similar to

convert (convert('1234567890','',col1),str(<replacement char/s>,len(convert('1234567890','',col1))),col1)

Sorry, can't validate the syntax at the moment. Its been long since I used it.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There's no function out of the box, but it would be very easy to create one.
Assuming that the string is already space-separated there is no real need to tokenise - if you think there is, please provide a more exact specification.

Code: Select all

FUNCTION ReplaceNonAlphaNumerics(aString,aReplaceChar)

AlphaNumerics = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
REM Add lower case alphabetic characters to string if required.

Ans = Convert(Convert(Alphanumerics, "", aString), "", aString)

RETURN(Ans)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
rameshrr3
Premium Member
Premium Member
Posts: 609
Joined: Mon May 10, 2004 3:32 am
Location: BRENTWOOD, TN

Post by rameshrr3 »

Thanks for validating my fears. Im going to continue using convert().

Effectively we are tokenizing a regular language sentence into a set of delimited 'words' and pivoting them , looking up with one table to eliminate 'useless' words & text noise ( prepositions, articles etc) and scan significant words against another keyword lookup table ( which can keep growing) and do an English keyword search . Each sentence is from a col called "reason desc" which stores free form text data. So for each reason id , we assign a weight based on keyword and sum it up per reason id and write back a total weight score that says how a sentence is potentially meaningful or not for feeding to a text analytics engine.
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

if you really don't know which all characters you will have in what you called as noise, you can use convert as suggested above.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Once you've effected the conversion, change the space characters to Char(10), write to a text file with no formatting, and read back from the text file with line terminator specified.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply