REMOVE DUPLICATE WORDS

ejazsalim · Post by **ejazsalim** » Tue Dec 22, 2009 3:01 pm

how do I remove words that are repeated more than once in a string

Example :
Input : JOHN DOE AND MARY DOE DBA JOHN INC

Output : JOHN DOE AND MARY DBA INC

Right now I am using DataStage to split all words and then get unique values was wondering if there was a simple way to do this in QualityStage was able to use pattern language but seems to be quite cumbersome.

Any suggestions ??

ray.wurlod · Post by **ray.wurlod** » Tue Dec 22, 2009 3:45 pm

You can use PAL but, as you note, it will be quite cumbersome. You'll need to have the first PrimaryName already copied, then do a condtional pattern to handle the second PrimaryName. The condition, of course, will be an equality test; if it is satisfied reclassify the second PrimaryName to the Null class (0). You will also have to handle the inequality case - probably move the entire second name to Additional Name Info. Other solutions almost certainly exist; that's how I'd approach it.

JRodriguez · Post by **JRodriguez** » Tue Dec 22, 2009 3:46 pm

I guess that you can use reverse floating positioning specifier or fix position specifier

Classify the token that you want to remove ...let's say to a class R

Below pattern will make the right most R token null

*R | $
retype [1] 0

You can used below with a routine and a REPEAT clause to make null all r token after the first one

%2R
retype [1] 0
REPEAT

ejazsalim · Post by **ejazsalim** » Tue Dec 22, 2009 3:52 pm

Thanks Ray/Rod I dont have much control over sorce data so I cannot CLASSIFY it. I think I will keep typing the PAL solution since there doesn't seems to be an easy way out

stuartjvnorton · Post by **stuartjvnorton** » Tue Dec 22, 2009 5:23 pm

How does it currently classify it?

Input : JOHN DOE AND MARY DOE DBA JOHN INC

Output : JOHN DOE AND MARY DBA INC

This example also looks like 2 or 3 separate pieces of information that make sense when parsed correctly and not just chopped up.

John Doe and Mary Doe - people's names, obviously

The rest could be something like:
DBA John Inc - Company name

or

DBA - a position description
John Inc - company name

If that example is indicative of the data you have in there, I'd be asking more questions about how they're using the field to work out what you should do.

Maybe you need a prep ruleset first to split it up (If the DBA is a position description, then that would be a finite number of values that could be used to crack the whole thing wide open).
If there are 2 or 3 pieces of separate information there, then they need to be split and parsed individually.
Otherwise, anything you might do to chop this or that out will only corrupt your data.

ray.wurlod · Post by **ray.wurlod** » Tue Dec 22, 2009 6:53 pm

I completely missed noticing the DBA !

ejazsalim · Post by **ejazsalim** » Wed Dec 23, 2009 7:34 pm

DBA = Doing Business As

there are a lot of other keywords like DBA embeded in the data.

What I am trying to do is remove all the first names and middle names and all the known keywords (like DBA/POD) and have a list of unknown words and also eliminate the duplicate words and then try to create a LOOSE match for exception handling (sorry if this is confusing)

right now I am writing a pattern file to get to the unique word

Follow up question

Is it possible to search for a variable in a string ?

;INPUT -- JOHN DOE DBA JOE PIZZA DBA PIZZA TO GO

&
COPY "DBA" vKeyWord01

*&=vKeyWord01|** ; WHAT IS THE RIGHT SYNTAX ?

*&="DBA" | ** ; THIS WORKS BUT NOT WHAT I NEED

Thanks in advance.

ray.wurlod · Post by **ray.wurlod** » Wed Dec 23, 2009 8:21 pm

These kinds of words (like "TA" for "trading as") can be classified into a suitable class. That will make the parsing and pattern-action easier to implement. You may need more patterns to handle things like "T/A".

stuartjvnorton · Post by **stuartjvnorton** » Wed Dec 23, 2009 8:42 pm

So to get this clear in my head: You want a unique list of unknown words and then (and this bit I understand less than the rest of it) use them to make some sort of key for matching?
Well, here goes nothing...

The unknown words bit is easy. Put everything you want removed into a classification file and then for every defined type, do the following:

;Here for F. Repeat for every type you defined.
0*F
RETYPE [1] 0

Deduping the unknown words within the PAT file will be a pain.
Off the top of my head (insert disclaimer here), something like this should work:

; Take note of the token you're trying to dedupe.
& | &
COPY [1] temp
RETYPE [1] 0

; Look for a second instance of it
; May also need to do this one a couple of times if the same unknown word shows up more than twice.
** | & [{} = temp] | [weirdDedupeKeyThingy = ""]
CONCAT [1] weirdDedupeKeyThingy
RETYPE [2] 0

** | & [{} = temp] | [weirdDedupeKeyThingy != ""]
CONCAT " " weirdDedupeKeyThingy
CONCAT [1] weirdDedupeKeyThingy
RETYPE [2] 0

Repeat the 2nd one a couple of times to form a block, then repeat the block, until you get what you want.
In the end you may have 1 + left:

& | $ [weirdDedupeKeyThingy = ""]
CONCAT [1] weirdDedupeKeyThingy
COPY weirdDedupeKeyThingy {WeirdThingyOutputField}
RETYPE [1] 0

& | $ [weirdDedupeKeyThingy != ""]
CONCAT " " weirdDedupeKeyThingy
CONCAT [1] weirdDedupeKeyThingy
COPY weirdDedupeKeyThingy {WeirdThingyOutputField}
RETYPE [1] 0

Still don't get the point though...

dsqspro · Post by **dsqspro** » Mon Jan 04, 2010 11:29 am

You will be needing a lot of data analysis before defining the what king of parsing rules are needed because just fixing your current name pattern might not fix entire data cleansing required before standardization or matching.

Step one- Identify know data patterns and unknown data patterns with example data.

Step two- show user and take their recommendations and define high level rules.

Step three- build QS jobs to properly place data into respective buckets.

like Prefix, First Name, Middle Name, Last Name, Suffix, Additional Name.