Capturing duplicates

raji33 · Post by **raji33** » Wed May 04, 2011 6:30 pm

Hi All

I have scenario where i need to capture the duplicate records to a file.
Ex;
column1 column2
1 james
2 james

also
ex:

column1 column2
1 mary
1 charle

if have both scenario i need to reject both the records in both scenarios how i can implement this logic through datastage.

Nripendra Chand · Post by **Nripendra Chand** » Wed May 04, 2011 7:21 pm

sort on key columns on which you want to find duplicates using a sort stage and set create key change column as 'yes'. Then use a filter or transformer stage by using following condition:
1. key change column =1, put this constraint in master output link (unique records)
2. key change column =0, put this constraint to collect duplicates.

Tejas Pujari · Post by **Tejas Pujari** » Thu May 05, 2011 1:04 am

Hi,

Single Sort stage will not do here. because two key columns in a sort stage means value for both the columns will be considered at a time.
so all will have change code as 1 until a record with both key columns repeats.

zulfi123786 · Post by **zulfi123786** » Thu May 05, 2011 1:34 am

Nripendra Chand wrote:sort on key columns on which you want to find duplicates using a sort stage and set create key change column as 'yes'. Then use a filter or transformer stage by using following condition:
1. key change column =1, put this constraint in master output link (unique records)
2. key change column =0, put this constraint to collect duplicates.

I guess the poster need both records to be rejected hence placing key change column =1 in the constraints wont work.....

you need to do a fork join using keys with key change column =0 to achieve the objective

jwiles · Post by **jwiles** » Thu May 05, 2011 9:48 am

So, in reality you are needing to identify and remove all duplicates on column 1, as well as identify and remove all duplicates on column 2.

Which 8.x version of IS are you using? The approach can be different with 8.5 than with 8.0 and 8.1.

Approach this one column at a time.

For 8.0/8.1: Partition/Sort on the first column and generate a keyChange column. Use a transformer to capture the values which have duplicate entries and output those values to a separate stream with a flag value, then join back to the main stream to flag and then delete all rows with those values.
Repeat this process for the second column.

For 8.5: The same can be accomplished with a sort followed by a transformer using the new looping and key-change recognition features. The first loop iteration can simply count rows for the key and set a flag based on the result, then the second iteration can keep or drop the records as needed.

Regards,