String Tokenizer

rameshrr3 · Post by **rameshrr3** » Thu Mar 07, 2013 3:03 pm

Is there any function in Basic similar to String tokenizer in java or strtok in c/c++ ?

Im trying to parse a list of text values ( like a regular language sentence) and replace all Non Alpha numeric characters with a single Non Alphanumeric character(per occurence) .

Ereplace() is very specific. Convert() requires me to supply all special chars in advance and specify as many replacement characters. I'm looking at something more simpler - if such a function exists .

priyadarshikunal · Post by **priyadarshikunal** » Thu Mar 07, 2013 3:16 pm

why not use convert multiple times something similar to

convert (convert('1234567890','',col1),str(<replacement char/s>,len(convert('1234567890','',col1))),col1)

Sorry, can't validate the syntax at the moment. Its been long since I used it.

ray.wurlod · Post by **ray.wurlod** » Fri Mar 08, 2013 12:03 am

There's no function out of the box, but it would be very easy to create one.
Assuming that the string is already space-separated there is no real need to tokenise - if you think there is, please provide a more exact specification.

Code: Select all

FUNCTION ReplaceNonAlphaNumerics(aString,aReplaceChar)

AlphaNumerics = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
REM Add lower case alphabetic characters to string if required.

Ans = Convert(Convert(Alphanumerics, "", aString), "", aString)

RETURN(Ans)

rameshrr3 · Post by **rameshrr3** » Fri Mar 08, 2013 11:55 am

Thanks for validating my fears. Im going to continue using convert().

Effectively we are tokenizing a regular language sentence into a set of delimited 'words' and pivoting them , looking up with one table to eliminate 'useless' words & text noise ( prepositions, articles etc) and scan significant words against another keyword lookup table ( which can keep growing) and do an English keyword search . Each sentence is from a col called "reason desc" which stores free form text data. So for each reason id , we assign a weight based on keyword and sum it up per reason id and write back a total weight score that says how a sentence is potentially meaningful or not for feeding to a text analytics engine.

priyadarshikunal · Post by **priyadarshikunal** » Fri Mar 08, 2013 12:16 pm

if you really don't know which all characters you will have in what you called as noise, you can use convert as suggested above.

ray.wurlod · Post by **ray.wurlod** » Fri Mar 08, 2013 7:21 pm

Once you've effected the conversion, change the space characters to Char(10), write to a text file with no formatting, and read back from the text file with line terminator specified.