REMOVE DUPLICATE WORDS
REMOVE DUPLICATE WORDS
how do I remove words that are repeated more than once in a string
Example :
Input : JOHN DOE AND MARY DOE DBA JOHN INC
Output : JOHN DOE AND MARY DBA INC
Right now I am using DataStage to split all words and then get unique values was wondering if there was a simple way to do this in QualityStage was able to use pattern language but seems to be quite cumbersome.
Any suggestions ??
Example :
Input : JOHN DOE AND MARY DOE DBA JOHN INC
Output : JOHN DOE AND MARY DBA INC
Right now I am using DataStage to split all words and then get unique values was wondering if there was a simple way to do this in QualityStage was able to use pattern language but seems to be quite cumbersome.
Any suggestions ??
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
You can use PAL but, as you note, it will be quite cumbersome. You'll need to have the first PrimaryName already copied, then do a condtional pattern to handle the second PrimaryName. The condition, of course, will be an equality test; if it is satisfied reclassify the second PrimaryName to the Null class (0). You will also have to handle the inequality case - probably move the entire second name to Additional Name Info. Other solutions almost certainly exist; that's how I'd approach it.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 425
- Joined: Sat Nov 19, 2005 9:26 am
- Location: New York City
- Contact:
I guess that you can use reverse floating positioning specifier or fix position specifier
Classify the token that you want to remove ...let's say to a class R
Below pattern will make the right most R token null
*R | $
retype [1] 0
You can used below with a routine and a REPEAT clause to make null all r token after the first one
%2R
retype [1] 0
REPEAT
Classify the token that you want to remove ...let's say to a class R
Below pattern will make the right most R token null
*R | $
retype [1] 0
You can used below with a routine and a REPEAT clause to make null all r token after the first one
%2R
retype [1] 0
REPEAT
Last edited by JRodriguez on Tue Dec 22, 2009 3:59 pm, edited 1 time in total.
Julio Rodriguez
ETL Developer by choice
"Sure we have lots of reasons for being rude - But no excuses
ETL Developer by choice
"Sure we have lots of reasons for being rude - But no excuses
-
- Participant
- Posts: 527
- Joined: Thu Apr 19, 2007 1:25 am
- Location: Melbourne
How does it currently classify it?
Input : JOHN DOE AND MARY DOE DBA JOHN INC
Output : JOHN DOE AND MARY DBA INC
This example also looks like 2 or 3 separate pieces of information that make sense when parsed correctly and not just chopped up.
John Doe and Mary Doe - people's names, obviously
The rest could be something like:
DBA John Inc - Company name
or
DBA - a position description
John Inc - company name
If that example is indicative of the data you have in there, I'd be asking more questions about how they're using the field to work out what you should do.
Maybe you need a prep ruleset first to split it up (If the DBA is a position description, then that would be a finite number of values that could be used to crack the whole thing wide open).
If there are 2 or 3 pieces of separate information there, then they need to be split and parsed individually.
Otherwise, anything you might do to chop this or that out will only corrupt your data.
Input : JOHN DOE AND MARY DOE DBA JOHN INC
Output : JOHN DOE AND MARY DBA INC
This example also looks like 2 or 3 separate pieces of information that make sense when parsed correctly and not just chopped up.
John Doe and Mary Doe - people's names, obviously
The rest could be something like:
DBA John Inc - Company name
or
DBA - a position description
John Inc - company name
If that example is indicative of the data you have in there, I'd be asking more questions about how they're using the field to work out what you should do.
Maybe you need a prep ruleset first to split it up (If the DBA is a position description, then that would be a finite number of values that could be used to crack the whole thing wide open).
If there are 2 or 3 pieces of separate information there, then they need to be split and parsed individually.
Otherwise, anything you might do to chop this or that out will only corrupt your data.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
DBA = Doing Business As
there are a lot of other keywords like DBA embeded in the data.
What I am trying to do is remove all the first names and middle names and all the known keywords (like DBA/POD) and have a list of unknown words and also eliminate the duplicate words and then try to create a LOOSE match for exception handling (sorry if this is confusing)
right now I am writing a pattern file to get to the unique word
Follow up question
Is it possible to search for a variable in a string ?
;INPUT -- JOHN DOE DBA JOE PIZZA DBA PIZZA TO GO
&
COPY "DBA" vKeyWord01
*&=vKeyWord01|** ; WHAT IS THE RIGHT SYNTAX ?
*&="DBA" | ** ; THIS WORKS BUT NOT WHAT I NEED
Thanks in advance.
there are a lot of other keywords like DBA embeded in the data.
What I am trying to do is remove all the first names and middle names and all the known keywords (like DBA/POD) and have a list of unknown words and also eliminate the duplicate words and then try to create a LOOSE match for exception handling (sorry if this is confusing)
right now I am writing a pattern file to get to the unique word
Follow up question
Is it possible to search for a variable in a string ?
;INPUT -- JOHN DOE DBA JOE PIZZA DBA PIZZA TO GO
&
COPY "DBA" vKeyWord01
*&=vKeyWord01|** ; WHAT IS THE RIGHT SYNTAX ?
*&="DBA" | ** ; THIS WORKS BUT NOT WHAT I NEED
Thanks in advance.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
These kinds of words (like "TA" for "trading as") can be classified into a suitable class. That will make the parsing and pattern-action easier to implement. You may need more patterns to handle things like "T/A".
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 527
- Joined: Thu Apr 19, 2007 1:25 am
- Location: Melbourne
So to get this clear in my head: You want a unique list of unknown words and then (and this bit I understand less than the rest of it) use them to make some sort of key for matching?
Well, here goes nothing...
The unknown words bit is easy. Put everything you want removed into a classification file and then for every defined type, do the following:
;Here for F. Repeat for every type you defined.
0*F
RETYPE [1] 0
Deduping the unknown words within the PAT file will be a pain.
Off the top of my head (insert disclaimer here), something like this should work:
; Take note of the token you're trying to dedupe.
& | &
COPY [1] temp
RETYPE [1] 0
; Look for a second instance of it
; May also need to do this one a couple of times if the same unknown word shows up more than twice.
** | & [{} = temp] | [weirdDedupeKeyThingy = ""]
CONCAT [1] weirdDedupeKeyThingy
RETYPE [2] 0
** | & [{} = temp] | [weirdDedupeKeyThingy != ""]
CONCAT " " weirdDedupeKeyThingy
CONCAT [1] weirdDedupeKeyThingy
RETYPE [2] 0
Repeat the 2nd one a couple of times to form a block, then repeat the block, until you get what you want.
In the end you may have 1 + left:
& | $ [weirdDedupeKeyThingy = ""]
CONCAT [1] weirdDedupeKeyThingy
COPY weirdDedupeKeyThingy {WeirdThingyOutputField}
RETYPE [1] 0
& | $ [weirdDedupeKeyThingy != ""]
CONCAT " " weirdDedupeKeyThingy
CONCAT [1] weirdDedupeKeyThingy
COPY weirdDedupeKeyThingy {WeirdThingyOutputField}
RETYPE [1] 0
Still don't get the point though...
Well, here goes nothing...
The unknown words bit is easy. Put everything you want removed into a classification file and then for every defined type, do the following:
;Here for F. Repeat for every type you defined.
0*F
RETYPE [1] 0
Deduping the unknown words within the PAT file will be a pain.
Off the top of my head (insert disclaimer here), something like this should work:
; Take note of the token you're trying to dedupe.
& | &
COPY [1] temp
RETYPE [1] 0
; Look for a second instance of it
; May also need to do this one a couple of times if the same unknown word shows up more than twice.
** | & [{} = temp] | [weirdDedupeKeyThingy = ""]
CONCAT [1] weirdDedupeKeyThingy
RETYPE [2] 0
** | & [{} = temp] | [weirdDedupeKeyThingy != ""]
CONCAT " " weirdDedupeKeyThingy
CONCAT [1] weirdDedupeKeyThingy
RETYPE [2] 0
Repeat the 2nd one a couple of times to form a block, then repeat the block, until you get what you want.
In the end you may have 1 + left:
& | $ [weirdDedupeKeyThingy = ""]
CONCAT [1] weirdDedupeKeyThingy
COPY weirdDedupeKeyThingy {WeirdThingyOutputField}
RETYPE [1] 0
& | $ [weirdDedupeKeyThingy != ""]
CONCAT " " weirdDedupeKeyThingy
CONCAT [1] weirdDedupeKeyThingy
COPY weirdDedupeKeyThingy {WeirdThingyOutputField}
RETYPE [1] 0
Still don't get the point though...
You will be needing a lot of data analysis before defining the what king of parsing rules are needed because just fixing your current name pattern might not fix entire data cleansing required before standardization or matching.
Step one- Identify know data patterns and unknown data patterns with example data.
Step two- show user and take their recommendations and define high level rules.
Step three- build QS jobs to properly place data into respective buckets.
like Prefix, First Name, Middle Name, Last Name, Suffix, Additional Name.
Step one- Identify know data patterns and unknown data patterns with example data.
Step two- show user and take their recommendations and define high level rules.
Step three- build QS jobs to properly place data into respective buckets.
like Prefix, First Name, Middle Name, Last Name, Suffix, Additional Name.