Page 1 of 1

Duplicate Records question

Posted: Thu Jun 14, 2007 12:49 pm
by rodre
Dealing with duplicate records....

I have a question that perhaps someone can help me:

I was able to extract the duplicates from a file bases on name and street. Now I need to extract some information that might be available in some of the duplicate records.
Here is an example:

Code: Select all

Id     FirstName    LastName        Street      Email         Phone    
1       John          Doe       1 Main    e@yahoo.com                   
2       John          Doe       1 Main                      345-2546 
3       John          Doe       1 Main                                            
The end result should be:

Code: Select all

Id     FirstName    LastName        Street      Email         Phone    
2(Any)  John          Doe       1 Main    e@yahoo.com         345-2546


Basically one record with any available info in email or phone.

Does anyone have any suggestions on how to do this... :?:

Thank you in advance for your help!! :)

Posted: Thu Jun 14, 2007 1:28 pm
by DSguru2B
Pass your data through aggregator, group by FirstName, LastName and Street. Provide anything for ID, and provide MAX() for all the rest of the columns. Anything will be greater than a space or empty byte or even a null.

Posted: Thu Jun 14, 2007 1:34 pm
by 47shailesh
If you are using select query than use NOT NULL constraint on each column that you are selecting

Posted: Thu Jun 14, 2007 1:35 pm
by gateleys
That's what I would do ... like DSGuru said.

Hey rodre, just wanted to know how you performed cleansing of the names and street information. For example, if your rows came in as -

Code: Select all

Id     FirstName    LastName        Street      Email         Phone    
1       Jon          Doe           1 Main    e@yahoo.com                    
2       John          Do           1 Main                      345-2546 
3       John          Doe         1 Main                             
Soundex will be too inclusive. QualityStage, definitely is the right tool. Just wanted to know how you did it.

gateleys

Posted: Thu Jun 14, 2007 1:37 pm
by gateleys
47shailesh wrote:If you are using select query than use NOT NULL constraint on each column that you are selecting
What????? :shock:

Can you be more elaborate?

gateleys

Posted: Thu Jun 14, 2007 2:34 pm
by rodre
DSguru2B wrote:Pass your data through aggregator, group by FirstName, LastName and Street. Provide anything for ID, and provide MAX() for all the rest of the columns. Anything will be greater than a space or empty byte or even a null.
That did it...
Thank You very much for your help!!! :D :D

Posted: Thu Jun 14, 2007 2:43 pm
by rodre
gateleys wrote:That's what I would do ... like DSGuru said.

Soundex will be too inclusive. QualityStage, definitely is the right tool. Just wanted to know how you did it.

gateleys
I used this thread to get the duplicates out...

viewtopic.php?t=97213&highlight=ChkDup

Hopes this Helps!! :)

Posted: Thu Jun 14, 2007 3:41 pm
by ray.wurlod
Definitely use QualityStage. This is precisely the kind of thing it does, with the added benefit of configurable levels of uncertainty in the matching phase. You can use NYSIIS and/or Soundex (forward or reverse) or not, as pleases you.