Can I split a token in 2 new tokens?

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
flavour
Participant
Posts: 11
Joined: Mon Dec 05, 2005 5:21 am

Can I split a token in 2 new tokens?

Post by flavour »

Hi,
I'm standardizing italian addresses (but I'll try to make an english address' example).
I've got a certain number of addresses like this:

SUNSET BOULEVARD12

Does anybody know a smart way to get:

SUNSET BOULEVARD 12

???
I already know a "smart but not optimal" way: I can create a special rule set where the SEPLIST is:

SEPLIST "1234567890"

and the only rules (supposing that the numbers take 5 digits max) are:

*^|^|^|^|^|$
COPY [1] TEMP
CONCAT [2] TEMP
CONCAT [3] TEMP
CONCAT [4] TEMP
CONCAT [5] TEMP
RETYPE [1] + TEMP TEMP
RETYPE [2] 0
RETYPE [3] 0
RETYPE [4] 0
RETYPE [5] 0

*^|^|^|^|$
.............

*^|^|^|$
............

*^|^|$
............

**
COPY_S [1] {UD}

Applying this rule set I'd reach my goal; the UD field would become the new input for my original rule set.

Many thanks
jhmckeever
Premium Member
Premium Member
Posts: 301
Joined: Thu Jul 14, 2005 10:27 am
Location: Melbourne, Australia
Contact:

Post by jhmckeever »

Flavour,

If you do an investigate on your example input you'll see it's classified as "<" (Leading Alphabetic). You can create a simple pattern to match your leading alphabetic token and then use the "-n" (trailing numeric characters) and "c" (leading alphabetic characters) options with COPY to split your token. E.g. For SUNSET BOULEVARD12 you'd use ...

Code: Select all

? | <
COPY [1] {WF}      ;Whatever field
COPY [2](c) {ST}   ;StreetType Field
COPY [2](-n) {HN}  ;HouseNumber field
etc.
See the "Copying Leading and Trailing Characters" section in the QualityStage documentation.

HTH,
J.
<b>John McKeever</b>
Data Migrators
<b><a href="https://www.mettleci.com">MettleCI</a> - DevOps for DataStage</b>
<a href="http://www.datamigrators.com/"><img src="https://www.datamigrators.com/assets/im ... l.png"></a>
flavour
Participant
Posts: 11
Joined: Mon Dec 05, 2005 5:21 am

Post by flavour »

Hi jhmckeever,

you're right.
My example was too simple: I forgot to say that I absolutely need the creation of the new token because subsequently there are a lot of rules (created months ago and I don't want to modify them) taking care of other things I didn't mention.
But your solution is really better than my "smart but not optimal".

Thank you!
jhmckeever
Premium Member
Premium Member
Posts: 301
Joined: Thu Jul 14, 2005 10:27 am
Location: Melbourne, Australia
Contact:

Post by jhmckeever »

Hi Flavour,

Hmmm - interesting one. One approach could be:

Code: Select all

? | < 
COPY [2](c) temp1      ;Leading characters
COPY [2](-n) temp2     ;Trailing numerics
CONCAT " " temp1
CONCAT temp2 temp1     ;Create temp1 " " temp2
RETYPE [2] ? temp1
PATTERN {IP}

This is just off the top of my head so apologies in advance if this doesn't work!

J.
<b>John McKeever</b>
Data Migrators
<b><a href="https://www.mettleci.com">MettleCI</a> - DevOps for DataStage</b>
<a href="http://www.datamigrators.com/"><img src="https://www.datamigrators.com/assets/im ... l.png"></a>
flavour
Participant
Posts: 11
Joined: Mon Dec 05, 2005 5:21 am

Post by flavour »

Really nice idea!! I'm going to try it.
See you soon!
flavour
Participant
Posts: 11
Joined: Mon Dec 05, 2005 5:21 am

Post by flavour »

Hi jhmckeever,

I'm sorry but I think that PATTERN command can't create a never-born token.
My input is:

VIALE FERRARI12 (in UK I think it would be: 12, BOULEVARD FERRARI)

Your rule (adapted to my inputs) is:

T | <
COPY [2](c) temp1 ;Leading characters
COPY [2](-n) temp2 ;Trailing numerics
CONCAT " " temp1
CONCAT temp2 temp1 ;Create temp1 " " temp2
RETYPE [2] ? temp1
PATTERN {IP}

Immediately after your rule I have this "STOP" rule:

**
COPY_S [1] {UD}
PATTERN {UP}
EXIT

Well, I can see what follows:

IP: T+ (It was T< before applying your rule)
UP: T+
UD: VIALE FERRARI 12

However, many thanks for your time!
jhmckeever
Premium Member
Premium Member
Posts: 301
Joined: Thu Jul 14, 2005 10:27 am
Location: Melbourne, Australia
Contact:

Post by jhmckeever »

Hi flavour,

The simplest solution is to take what I described in my previous email and use it as a pre-processor. Just pass your data into the pattern to split the '<' tokens where appropriate, then take the resulting file and pass it into your original pattern file, where it will be re-tokenised into multiple tokens before being processed.

Other alternative solutions are ...

1. Check out the CONVERT_R (Convert with retokenization) operator. You may have to create a clunky solution to use this, but it would permit the creation of a new token on-the-fly.

2. Alternatively, if you're hosting your QualityStage job in a DataStage job you could just identify and split the offending token in a DataStage transformer before submitting it to QualityStage.

HTH,
J.
<b>John McKeever</b>
Data Migrators
<b><a href="https://www.mettleci.com">MettleCI</a> - DevOps for DataStage</b>
<a href="http://www.datamigrators.com/"><img src="https://www.datamigrators.com/assets/im ... l.png"></a>
flavour
Participant
Posts: 11
Joined: Mon Dec 05, 2005 5:21 am

Post by flavour »

Hi jhmckeever,

I agree with your conclusions.

Thank you very much!
hans.tau.hatlestad@intelc
Participant
Posts: 1
Joined: Mon Feb 05, 2007 4:51 am

Re: Can I split a token in 2 new tokens?

Post by hans.tau.hatlestad@intelc »

Hi All!

I was really looking for a simpler solution to this, but here is my "acceptably less complex :?
solution using convert_s for the token splitting:

+|>
copy [2] temp ;save for catch up below
copy "12" splitvalue ;help value for split
retype [2] S splitvalue ;precondition for split
+|S
convert_s [2] @splitvalues.tbl TKN B > ; table with one line "2 2"
+|>|B ; the new born token (B)
retype [2] > temp ;put original value back in
+|>|B
copy [2](c) chr_part ;get streetname suffix
copy [2](-n) num_part ;get house number
retype [2] + chr_part ; put in wanted value
retype [3] ^ num_part ; put in wanted value

RESULT:

+ + ^
SUNSET BOULEVARD 12

flavour wrote:Hi,
I'm standardizing italian addresses (but I'll try to make an english address' example).
I've got a certain number of addresses like this:

SUNSET BOULEVARD12

Does anybody know a smart way to get:

SUNSET BOULEVARD 12

???
I already know a "smart but not optimal" way: I can create a special rule set where the SEPLIST is:

SEPLIST "1234567890"

and the only rules (supposing that the numbers take 5 digits max) are:

*^|^|^|^|^|$
COPY [1] TEMP
CONCAT [2] TEMP
CONCAT [3] TEMP
CONCAT [4] TEMP
CONCAT [5] TEMP
RETYPE [1] + TEMP TEMP
RETYPE [2] 0
RETYPE [3] 0
RETYPE [4] 0
RETYPE [5] 0

*^|^|^|^|$
.............

*^|^|^|$
............

*^|^|$
............

**
COPY_S [1] {UD}

Applying this rule set I'd reach my goal; the UD field would become the new input for my original rule set.

Many thanks
Post Reply