Page 1 of 1

Removing both duplicates

Posted: Thu Sep 10, 2009 7:02 am
by hobocamp
I have a requirement in which if a duplicate value occurs in a particular field, I need to drop both (or all) of the records with that value.

I'm guessing there may be a sql statement that could be devised to handle this, but I'd like to do it without that if possible. Anyone have suggestions?

Thanks in advance.

Tom

Posted: Thu Sep 10, 2009 7:22 am
by Sainath.Srinivasan
Sort the data with cluster change attribute enabled.

Filter only where change is 0 and use it to lookup and ignore duplicates.

Posted: Thu Sep 10, 2009 7:24 am
by ArndW
How about sorting the data on your key criteria, adding a cluster key change column (1 for first duplicate, 0 for subsequent ones), then sort again but add the change column, then a transform stage using a stage variable to store previous record key data. If change column = 1 and previous record is change column 0 and has the same key, drop the record. Drop the record if the change column is 0 as well.

Posted: Thu Sep 10, 2009 9:04 am
by singhald
Hi Tom,

Welcome on board.

I have implemented this solution before. I will suggest you to use two sort stages and one transformer stage to remove both instance incase of duplication.


1st Sort Stage: you need to sort the data on keys on which you are identifying duplicate records & enable KEY CHANGE column in this sort stage.

2nd Sort Stage: use the same fields on which you sorted the data and keep dont sort previously sorted data and sort only KeyChange Field in ascending order.

Transformer Stage: use two stage variables:
1stStage variable--> DupRec:=>If keyChange =0 Or (keyChange=1 And PreKeyChange = 0) Then "DUPE" Else "UNIQUE"

2ndStageVar--->PreKeyChange =>Map it to keychange field

In your transformer Stage, you can put the variable DUPE in to constraint to get all duplicate records and "UNIQUE" to get all unique records.

Let me know if you face any problem.

Re: Removing both duplicates

Posted: Thu Sep 10, 2009 10:43 am
by betterthanever
hobocamp wrote:I have a requirement in which if a duplicate value occurs in a particular field, I need to drop both (or all) of the records with that value.

I'm guessing there may be a sql statement that could be devised to handle this, but I'd like to do it without that if possible. Anyone have suggestions?

Thanks in advance.

Tom