Page 1 of 1

Capturing duplicates

Posted: Wed May 04, 2011 6:30 pm
by raji33
Hi All

I have scenario where i need to capture the duplicate records to a file.
Ex;
column1 column2
1 james
2 james

also
ex:

column1 column2
1 mary
1 charle

if have both scenario i need to reject both the records in both scenarios how i can implement this logic through datastage.

Posted: Wed May 04, 2011 7:21 pm
by Nripendra Chand
sort on key columns on which you want to find duplicates using a sort stage and set create key change column as 'yes'. Then use a filter or transformer stage by using following condition:
1. key change column =1, put this constraint in master output link (unique records)
2. key change column =0, put this constraint to collect duplicates.

Posted: Thu May 05, 2011 1:04 am
by Tejas Pujari
Hi,

Single Sort stage will not do here. because two key columns in a sort stage means value for both the columns will be considered at a time.
so all will have change code as 1 until a record with both key columns repeats.

Posted: Thu May 05, 2011 1:34 am
by zulfi123786
Nripendra Chand wrote:sort on key columns on which you want to find duplicates using a sort stage and set create key change column as 'yes'. Then use a filter or transformer stage by using following condition:
1. key change column =1, put this constraint in master output link (unique records)
2. key change column =0, put this constraint to collect duplicates.
I guess the poster need both records to be rejected hence placing key change column =1 in the constraints wont work.....

you need to do a fork join using keys with key change column =0 to achieve the objective

Posted: Thu May 05, 2011 9:48 am
by jwiles
So, in reality you are needing to identify and remove all duplicates on column 1, as well as identify and remove all duplicates on column 2.

Which 8.x version of IS are you using? The approach can be different with 8.5 than with 8.0 and 8.1.

Approach this one column at a time.

For 8.0/8.1: Partition/Sort on the first column and generate a keyChange column. Use a transformer to capture the values which have duplicate entries and output those values to a separate stream with a flag value, then join back to the main stream to flag and then delete all rows with those values.
Repeat this process for the second column.

For 8.5: The same can be accomplished with a sort followed by a transformer using the new looping and key-change recognition features. The first loop iteration can simply count rows for the key and set a flag based on the result, then the second iteration can keep or drop the records as needed.

Regards,