Page 1 of 1
capture duplicates
Posted: Tue Jan 08, 2008 1:16 am
by just4u_sharath
From dataset i am removind duplicates using the remove duplicate state. Now my requirement is to capture those duplicates which are removed and place in a sequentil file. How can i capture those removed duplicates.
Re: capture duplicates
Posted: Tue Jan 08, 2008 2:17 am
by Mayur Dongaonkar
Duplicates can be captured by following stage:
Dataset ----> sort ( on key columns ) ---> aggregator ( on key columns + count operation ) ---> filter ( count > 1 ) ---> sequencial file
Posted: Tue Jan 08, 2008 2:39 am
by Maveric
Set the "Create Cluster Key Change Column" property in sort stage to true. This creates the output field "clusterKeyChange". The values in this field will be 1 for a record, and 0 for all its duplicate records. Using the filter stage you can get the duplicates in one link and unique records in one link by applying the filter condition on "clusterKeyChange" field.
Posted: Mon Mar 03, 2008 6:37 am
by Das
Maveric wrote:Set the "Create Cluster Key Change Column" property in sort stage to true. This creates the output field "clusterKeyChange". The values in this field will be 1 for a record, and 0 for all its duplicate records. Using the filter stage you can get the duplicates in one link and unique records in one link by applying the filter condition on "clusterKeyChange" field.
Its OK but i have a dobt why we need to go for
ClusterKeyChange ,Does it possible by
KeyChange.I have used
key change in meny occations .Any body can explain the senariao in which we are going for
ClusterKeyChange.and Whts the difference
Thanks in advance
Posted: Thu Jan 29, 2009 3:52 am
by yousuff1710
You are right, keychange option is used when sort mode is: sort . ClusterKeyChange is used for sort mode = Dont sort (previously sorted).
Posted: Fri Jan 30, 2009 6:27 am
by keshav0307
this has been discussed so many times... just try some search "capture duplicate"