Capturing duplicates

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
raji33
Premium Member
Premium Member
Posts: 151
Joined: Thu Sep 23, 2010 9:21 pm
Location: NJ

Capturing duplicates

Post by raji33 »

Hi All

I have scenario where i need to capture the duplicate records to a file.
Ex;
column1 column2
1 james
2 james

also
ex:

column1 column2
1 mary
1 charle

if have both scenario i need to reject both the records in both scenarios how i can implement this logic through datastage.
Nripendra Chand
Premium Member
Premium Member
Posts: 196
Joined: Tue Nov 23, 2004 11:50 pm
Location: Sydney (Australia)

Post by Nripendra Chand »

sort on key columns on which you want to find duplicates using a sort stage and set create key change column as 'yes'. Then use a filter or transformer stage by using following condition:
1. key change column =1, put this constraint in master output link (unique records)
2. key change column =0, put this constraint to collect duplicates.
-Nripendra Chand
Tejas Pujari
Participant
Posts: 14
Joined: Thu Jul 10, 2008 7:37 am
Location: mumbai

Post by Tejas Pujari »

Hi,

Single Sort stage will not do here. because two key columns in a sort stage means value for both the columns will be considered at a time.
so all will have change code as 1 until a record with both key columns repeats.
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

Post by zulfi123786 »

Nripendra Chand wrote:sort on key columns on which you want to find duplicates using a sort stage and set create key change column as 'yes'. Then use a filter or transformer stage by using following condition:
1. key change column =1, put this constraint in master output link (unique records)
2. key change column =0, put this constraint to collect duplicates.
I guess the poster need both records to be rejected hence placing key change column =1 in the constraints wont work.....

you need to do a fork join using keys with key change column =0 to achieve the objective
- Zulfi
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

So, in reality you are needing to identify and remove all duplicates on column 1, as well as identify and remove all duplicates on column 2.

Which 8.x version of IS are you using? The approach can be different with 8.5 than with 8.0 and 8.1.

Approach this one column at a time.

For 8.0/8.1: Partition/Sort on the first column and generate a keyChange column. Use a transformer to capture the values which have duplicate entries and output those values to a separate stream with a flag value, then join back to the main stream to flag and then delete all rows with those values.
Repeat this process for the second column.

For 8.5: The same can be accomplished with a sort followed by a transformer using the new looping and key-change recognition features. The first loop iteration can simply count rows for the key and set a flag based on the result, then the second iteration can keep or drop the records as needed.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
Post Reply