Can I split a token in 2 new tokens?

flavour · Post by **flavour** » Mon Nov 06, 2006 10:44 am

Hi,
I'm standardizing italian addresses (but I'll try to make an english address' example).
I've got a certain number of addresses like this:

SUNSET BOULEVARD12

Does anybody know a smart way to get:

SUNSET BOULEVARD 12

???
I already know a "smart but not optimal" way: I can create a special rule set where the SEPLIST is:

SEPLIST "1234567890"

and the only rules (supposing that the numbers take 5 digits max) are:

*^|^|^|^|^|$
COPY [1] TEMP
CONCAT [2] TEMP
CONCAT [3] TEMP
CONCAT [4] TEMP
CONCAT [5] TEMP
RETYPE [1] + TEMP TEMP
RETYPE [2] 0
RETYPE [3] 0
RETYPE [4] 0
RETYPE [5] 0

*^|^|^|^|$
.............

*^|^|^|$
............

*^|^|$
............

**
COPY_S [1] {UD}

Applying this rule set I'd reach my goal; the UD field would become the new input for my original rule set.

Many thanks

jhmckeever · Post by **jhmckeever** » Mon Nov 06, 2006 12:40 pm

Flavour,

If you do an investigate on your example input you'll see it's classified as "<" (Leading Alphabetic). You can create a simple pattern to match your leading alphabetic token and then use the "-n" (trailing numeric characters) and "c" (leading alphabetic characters) options with COPY to split your token. E.g. For SUNSET BOULEVARD12 you'd use ...

Code: Select all

? | <
COPY [1] {WF}      ;Whatever field
COPY [2](c) {ST}   ;StreetType Field
COPY [2](-n) {HN}  ;HouseNumber field
etc.

See the "Copying Leading and Trailing Characters" section in the QualityStage documentation.

HTH,
J.

flavour · Post by **flavour** » Tue Nov 07, 2006 2:34 am

Hi jhmckeever,

you're right.
My example was too simple: I forgot to say that I absolutely need the creation of the new token because subsequently there are a lot of rules (created months ago and I don't want to modify them) taking care of other things I didn't mention.
But your solution is really better than my "smart but not optimal".

Thank you!

jhmckeever · Post by **jhmckeever** » Tue Nov 07, 2006 5:35 am

Hi Flavour,

Hmmm - interesting one. One approach could be:

Code: Select all

? | < 
COPY [2](c) temp1      ;Leading characters
COPY [2](-n) temp2     ;Trailing numerics
CONCAT " " temp1
CONCAT temp2 temp1     ;Create temp1 " " temp2
RETYPE [2] ? temp1
PATTERN {IP}

This is just off the top of my head so apologies in advance if this doesn't work!

J.

flavour · Post by **flavour** » Tue Nov 07, 2006 5:52 am

Really nice idea!! I'm going to try it.
See you soon!

flavour · Post by **flavour** » Tue Nov 07, 2006 6:15 am

Hi jhmckeever,

I'm sorry but I think that PATTERN command can't create a never-born token.
My input is:

VIALE FERRARI12 (in UK I think it would be: 12, BOULEVARD FERRARI)

Your rule (adapted to my inputs) is:

T | <
COPY [2](c) temp1 ;Leading characters
COPY [2](-n) temp2 ;Trailing numerics
CONCAT " " temp1
CONCAT temp2 temp1 ;Create temp1 " " temp2
RETYPE [2] ? temp1
PATTERN {IP}

Immediately after your rule I have this "STOP" rule:

**
COPY_S [1] {UD}
PATTERN {UP}
EXIT

Well, I can see what follows:

IP: T+ (It was T< before applying your rule)
UP: T+
UD: VIALE FERRARI 12

However, many thanks for your time!

jhmckeever · Post by **jhmckeever** » Tue Nov 07, 2006 10:15 am

Hi flavour,

The simplest solution is to take what I described in my previous email and use it as a pre-processor. Just pass your data into the pattern to split the '<' tokens where appropriate, then take the resulting file and pass it into your original pattern file, where it will be re-tokenised into multiple tokens before being processed.

Other alternative solutions are ...

1. Check out the CONVERT_R (Convert with retokenization) operator. You may have to create a clunky solution to use this, but it would permit the creation of a new token on-the-fly.

2. Alternatively, if you're hosting your QualityStage job in a DataStage job you could just identify and split the offending token in a DataStage transformer before submitting it to QualityStage.

HTH,
J.

flavour · Post by **flavour** » Tue Nov 07, 2006 10:24 am

Hi jhmckeever,

I agree with your conclusions.

Thank you very much!

hans.tau.hatlestad@intelc · Wed Jul 02, 2008 4:11 pm

Hi All!

I was really looking for a simpler solution to this, but here is my "acceptably less complex

solution using convert_s for the token splitting:

+|>
copy [2] temp ;save for catch up below
copy "12" splitvalue ;help value for split
retype [2] S splitvalue ;precondition for split
+|S
convert_s [2] @splitvalues.tbl TKN B > ; table with one line "2 2"
+|>|B ; the new born token (B)
retype [2] > temp ;put original value back in
+|>|B
copy [2](c) chr_part ;get streetname suffix
copy [2](-n) num_part ;get house number
retype [2] + chr_part ; put in wanted value
retype [3] ^ num_part ; put in wanted value

RESULT:

+ + ^
SUNSET BOULEVARD 12

flavour wrote:Hi,
I'm standardizing italian addresses (but I'll try to make an english address' example).
I've got a certain number of addresses like this:

SUNSET BOULEVARD12

Does anybody know a smart way to get:

SUNSET BOULEVARD 12

???
I already know a "smart but not optimal" way: I can create a special rule set where the SEPLIST is:

SEPLIST "1234567890"

and the only rules (supposing that the numbers take 5 digits max) are:

*^|^|^|^|^|$
COPY [1] TEMP
CONCAT [2] TEMP
CONCAT [3] TEMP
CONCAT [4] TEMP
CONCAT [5] TEMP
RETYPE [1] + TEMP TEMP
RETYPE [2] 0
RETYPE [3] 0
RETYPE [4] 0
RETYPE [5] 0

*^|^|^|^|$
.............

*^|^|^|$
............

*^|^|$
............

**
COPY_S [1] {UD}

Applying this rule set I'd reach my goal; the UD field would become the new input for my original rule set.

Many thanks

DSXchange

Can I split a token in 2 new tokens?

Can I split a token in 2 new tokens?

Re: Can I split a token in 2 new tokens?