Page 1 of 1
Can I split a token in 2 new tokens?
Posted: Mon Nov 06, 2006 10:44 am
by flavour
Hi,
I'm standardizing italian addresses (but I'll try to make an english address' example).
I've got a certain number of addresses like this:
SUNSET BOULEVARD12
Does anybody know a smart way to get:
SUNSET BOULEVARD 12
???
I already know a "smart but not optimal" way: I can create a special rule set where the SEPLIST is:
SEPLIST "1234567890"
and the only rules (supposing that the numbers take 5 digits max) are:
*^|^|^|^|^|$
COPY [1] TEMP
CONCAT [2] TEMP
CONCAT [3] TEMP
CONCAT [4] TEMP
CONCAT [5] TEMP
RETYPE [1] + TEMP TEMP
RETYPE [2] 0
RETYPE [3] 0
RETYPE [4] 0
RETYPE [5] 0
*^|^|^|^|$
.............
*^|^|^|$
............
*^|^|$
............
**
COPY_S [1] {UD}
Applying this rule set I'd reach my goal; the UD field would become the new input for my original rule set.
Many thanks
Posted: Mon Nov 06, 2006 12:40 pm
by jhmckeever
Flavour,
If you do an investigate on your example input you'll see it's classified as "<" (Leading Alphabetic). You can create a simple pattern to match your leading alphabetic token and then use the "-n" (trailing numeric characters) and "c" (leading alphabetic characters) options with COPY to split your token. E.g. For SUNSET BOULEVARD12 you'd use ...
Code: Select all
? | <
COPY [1] {WF} ;Whatever field
COPY [2](c) {ST} ;StreetType Field
COPY [2](-n) {HN} ;HouseNumber field
etc.
See the "Copying Leading and Trailing Characters" section in the QualityStage documentation.
HTH,
J.
Posted: Tue Nov 07, 2006 2:34 am
by flavour
Hi jhmckeever,
you're right.
My example was too simple: I forgot to say that I absolutely need the creation of the new token because subsequently there are a lot of rules (created months ago and I don't want to modify them) taking care of other things I didn't mention.
But your solution is really better than my "smart but not optimal".
Thank you!
Posted: Tue Nov 07, 2006 5:35 am
by jhmckeever
Hi Flavour,
Hmmm - interesting one. One approach could be:
Code: Select all
? | <
COPY [2](c) temp1 ;Leading characters
COPY [2](-n) temp2 ;Trailing numerics
CONCAT " " temp1
CONCAT temp2 temp1 ;Create temp1 " " temp2
RETYPE [2] ? temp1
PATTERN {IP}
This is just off the top of my head so apologies in advance if this doesn't work!
J.
Posted: Tue Nov 07, 2006 5:52 am
by flavour
Really nice idea!! I'm going to try it.
See you soon!
Posted: Tue Nov 07, 2006 6:15 am
by flavour
Hi jhmckeever,
I'm sorry but I think that PATTERN command can't create a never-born token.
My input is:
VIALE FERRARI12 (in UK I think it would be: 12, BOULEVARD FERRARI)
Your rule (adapted to my inputs) is:
T | <
COPY [2](c) temp1 ;Leading characters
COPY [2](-n) temp2 ;Trailing numerics
CONCAT " " temp1
CONCAT temp2 temp1 ;Create temp1 " " temp2
RETYPE [2] ? temp1
PATTERN {IP}
Immediately after your rule I have this "STOP" rule:
**
COPY_S [1] {UD}
PATTERN {UP}
EXIT
Well, I can see what follows:
IP: T+ (It was T< before applying your rule)
UP: T+
UD: VIALE FERRARI 12
However, many thanks for your time!
Posted: Tue Nov 07, 2006 10:15 am
by jhmckeever
Hi flavour,
The simplest solution is to take what I described in my previous email and use it as a pre-processor. Just pass your data into the pattern to split the '<' tokens where appropriate, then take the resulting file and pass it into your original pattern file, where it will be re-tokenised into multiple tokens before being processed.
Other alternative solutions are ...
1. Check out the CONVERT_R (Convert with retokenization) operator. You may have to create a clunky solution to use this, but it would permit the creation of a new token on-the-fly.
2. Alternatively, if you're hosting your QualityStage job in a DataStage job you could just identify and split the offending token in a DataStage transformer before submitting it to QualityStage.
HTH,
J.
Posted: Tue Nov 07, 2006 10:24 am
by flavour
Hi jhmckeever,
I agree with your conclusions.
Thank you very much!
Re: Can I split a token in 2 new tokens?
Posted: Wed Jul 02, 2008 4:11 pm
by hans.tau.hatlestad@intelc
Hi All!
I was really looking for a simpler solution to this, but here is my "acceptably less complex
![Confused :?](./images/smilies/icon_confused.gif)
solution using convert_s for the token splitting:
+|>
copy [2] temp ;save for catch up below
copy "12" splitvalue ;help value for split
retype [2] S splitvalue ;precondition for split
+|S
convert_s [2] @splitvalues.tbl TKN B > ; table with one line "2 2"
+|>|B ; the new born token (B)
retype [2] > temp ;put original value back in
+|>|B
copy [2](c) chr_part ;get streetname suffix
copy [2](-n) num_part ;get house number
retype [2] + chr_part ; put in wanted value
retype [3] ^ num_part ; put in wanted value
RESULT:
+ + ^
SUNSET BOULEVARD 12
flavour wrote:Hi,
I'm standardizing italian addresses (but I'll try to make an english address' example).
I've got a certain number of addresses like this:
SUNSET BOULEVARD12
Does anybody know a smart way to get:
SUNSET BOULEVARD 12
???
I already know a "smart but not optimal" way: I can create a special rule set where the SEPLIST is:
SEPLIST "1234567890"
and the only rules (supposing that the numbers take 5 digits max) are:
*^|^|^|^|^|$
COPY [1] TEMP
CONCAT [2] TEMP
CONCAT [3] TEMP
CONCAT [4] TEMP
CONCAT [5] TEMP
RETYPE [1] + TEMP TEMP
RETYPE [2] 0
RETYPE [3] 0
RETYPE [4] 0
RETYPE [5] 0
*^|^|^|^|$
.............
*^|^|^|$
............
*^|^|$
............
**
COPY_S [1] {UD}
Applying this rule set I'd reach my goal; the UD field would become the new input for my original rule set.
Many thanks