Duplicate Records question

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
rodre
Premium Member
Premium Member
Posts: 218
Joined: Wed Mar 01, 2006 1:28 pm
Location: Tennessee

Duplicate Records question

Post by rodre »

Dealing with duplicate records....

I have a question that perhaps someone can help me:

I was able to extract the duplicates from a file bases on name and street. Now I need to extract some information that might be available in some of the duplicate records.
Here is an example:

Code: Select all

Id     FirstName    LastName        Street      Email         Phone    
1       John          Doe       1 Main    e@yahoo.com                   
2       John          Doe       1 Main                      345-2546 
3       John          Doe       1 Main                                            
The end result should be:

Code: Select all

Id     FirstName    LastName        Street      Email         Phone    
2(Any)  John          Doe       1 Main    e@yahoo.com         345-2546


Basically one record with any available info in email or phone.

Does anyone have any suggestions on how to do this... :?:

Thank you in advance for your help!! :)
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Pass your data through aggregator, group by FirstName, LastName and Street. Provide anything for ID, and provide MAX() for all the rest of the columns. Anything will be greater than a space or empty byte or even a null.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
47shailesh
Participant
Posts: 60
Joined: Tue Aug 29, 2006 11:14 pm

Post by 47shailesh »

If you are using select query than use NOT NULL constraint on each column that you are selecting
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Post by gateleys »

That's what I would do ... like DSGuru said.

Hey rodre, just wanted to know how you performed cleansing of the names and street information. For example, if your rows came in as -

Code: Select all

Id     FirstName    LastName        Street      Email         Phone    
1       Jon          Doe           1 Main    e@yahoo.com                    
2       John          Do           1 Main                      345-2546 
3       John          Doe         1 Main                             
Soundex will be too inclusive. QualityStage, definitely is the right tool. Just wanted to know how you did it.

gateleys
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Post by gateleys »

47shailesh wrote:If you are using select query than use NOT NULL constraint on each column that you are selecting
What????? :shock:

Can you be more elaborate?

gateleys
rodre
Premium Member
Premium Member
Posts: 218
Joined: Wed Mar 01, 2006 1:28 pm
Location: Tennessee

Post by rodre »

DSguru2B wrote:Pass your data through aggregator, group by FirstName, LastName and Street. Provide anything for ID, and provide MAX() for all the rest of the columns. Anything will be greater than a space or empty byte or even a null.
That did it...
Thank You very much for your help!!! :D :D
rodre
Premium Member
Premium Member
Posts: 218
Joined: Wed Mar 01, 2006 1:28 pm
Location: Tennessee

Post by rodre »

gateleys wrote:That's what I would do ... like DSGuru said.

Soundex will be too inclusive. QualityStage, definitely is the right tool. Just wanted to know how you did it.

gateleys
I used this thread to get the duplicates out...

viewtopic.php?t=97213&highlight=ChkDup

Hopes this Helps!! :)
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Definitely use QualityStage. This is precisely the kind of thing it does, with the added benefit of configurable levels of uncertainty in the matching phase. You can use NYSIIS and/or Soundex (forward or reverse) or not, as pleases you.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply