Duplicate Records question

rodre · Post by **rodre** » Thu Jun 14, 2007 12:49 pm

Dealing with duplicate records....

I have a question that perhaps someone can help me:

I was able to extract the duplicates from a file bases on name and street. Now I need to extract some information that might be available in some of the duplicate records.
Here is an example:

Code: Select all

Id     FirstName    LastName        Street      Email         Phone    
1       John          Doe       1 Main    e@yahoo.com                   
2       John          Doe       1 Main                      345-2546 
3       John          Doe       1 Main

The end result should be:

Code: Select all

Id     FirstName    LastName        Street      Email         Phone    
2(Any)  John          Doe       1 Main    e@yahoo.com         345-2546

Basically one record with any available info in email or phone.

Does anyone have any suggestions on how to do this...

Thank you in advance for your help!!

DSguru2B · Post by **DSguru2B** » Thu Jun 14, 2007 1:28 pm

Pass your data through aggregator, group by FirstName, LastName and Street. Provide anything for ID, and provide MAX() for all the rest of the columns. Anything will be greater than a space or empty byte or even a null.

47shailesh · Post by **47shailesh** » Thu Jun 14, 2007 1:34 pm

If you are using select query than use NOT NULL constraint on each column that you are selecting

gateleys · Post by **gateleys** » Thu Jun 14, 2007 1:35 pm

That's what I would do ... like DSGuru said.

Hey rodre, just wanted to know how you performed cleansing of the names and street information. For example, if your rows came in as -

Code: Select all

Id     FirstName    LastName        Street      Email         Phone    
1       Jon          Doe           1 Main    e@yahoo.com                    
2       John          Do           1 Main                      345-2546 
3       John          Doe         1 Main

Soundex will be too inclusive. QualityStage, definitely is the right tool. Just wanted to know how you did it.

gateleys

gateleys · Post by **gateleys** » Thu Jun 14, 2007 1:37 pm

47shailesh wrote:If you are using select query than use NOT NULL constraint on each column that you are selecting

What?????

Can you be more elaborate?

gateleys

rodre · Post by **rodre** » Thu Jun 14, 2007 2:34 pm

DSguru2B wrote:Pass your data through aggregator, group by FirstName, LastName and Street. Provide anything for ID, and provide MAX() for all the rest of the columns. Anything will be greater than a space or empty byte or even a null.

That did it...
Thank You very much for your help!!! :D :D

rodre · Post by **rodre** » Thu Jun 14, 2007 2:43 pm

gateleys wrote:That's what I would do ... like DSGuru said.

Soundex will be too inclusive. QualityStage, definitely is the right tool. Just wanted to know how you did it.

gateleys

I used this thread to get the duplicates out...

viewtopic.php?t=97213&highlight=ChkDup

Hopes this Helps!!

ray.wurlod · Post by **ray.wurlod** » Thu Jun 14, 2007 3:41 pm

Definitely use QualityStage. This is precisely the kind of thing it does, with the added benefit of configurable levels of uncertainty in the matching phase. You can use NYSIIS and/or Soundex (forward or reverse) or not, as pleases you.