Page 1 of 1

capture duplicates

Posted: Tue Jan 08, 2008 1:16 am
by just4u_sharath
From dataset i am removind duplicates using the remove duplicate state. Now my requirement is to capture those duplicates which are removed and place in a sequentil file. How can i capture those removed duplicates.

Re: capture duplicates

Posted: Tue Jan 08, 2008 2:17 am
by Mayur Dongaonkar
Duplicates can be captured by following stage:

Dataset ----> sort ( on key columns ) ---> aggregator ( on key columns + count operation ) ---> filter ( count > 1 ) ---> sequencial file

Posted: Tue Jan 08, 2008 2:39 am
by Maveric
Set the "Create Cluster Key Change Column" property in sort stage to true. This creates the output field "clusterKeyChange". The values in this field will be 1 for a record, and 0 for all its duplicate records. Using the filter stage you can get the duplicates in one link and unique records in one link by applying the filter condition on "clusterKeyChange" field.

Posted: Mon Mar 03, 2008 6:37 am
by Das
Maveric wrote:Set the "Create Cluster Key Change Column" property in sort stage to true. This creates the output field "clusterKeyChange". The values in this field will be 1 for a record, and 0 for all its duplicate records. Using the filter stage you can get the duplicates in one link and unique records in one link by applying the filter condition on "clusterKeyChange" field.
Its OK but i have a dobt why we need to go for ClusterKeyChange ,Does it possible by KeyChange.I have used key change in meny occations .Any body can explain the senariao in which we are going for ClusterKeyChange.and Whts the difference

Thanks in advance

Posted: Thu Jan 29, 2009 3:52 am
by yousuff1710
You are right, keychange option is used when sort mode is: sort . ClusterKeyChange is used for sort mode = Dont sort (previously sorted).

Posted: Fri Jan 30, 2009 6:27 am
by keshav0307
this has been discussed so many times... just try some search "capture duplicate"