Page 1 of 1

Capture Duplicates

Posted: Tue Mar 06, 2007 7:30 am
by bikan
Here is the data:
State City
a 1
a 2
a 3
b 4
b 5
c 6

I want remove duplcates (whic i can do using remove duplicate stage). And also want to capture removed duplicates.
i,e,

Output 1:
a 1
b 4
c 6

Output2;
a 2
a 3
a 4
a 5
b 5

Any suggestions.

Posted: Tue Mar 06, 2007 7:43 am
by DSguru2B
Search for it. It has been discussed time and again with different methods.

Re: Capture Duplicates

Posted: Tue Mar 06, 2007 4:12 pm
by us1aslam1us
bikan wrote: Output2:
a 2
a 3
a 4
a 5
b 5
Any suggestions.
On what logic do want the extract that output2??? That doesn't seem to be just duplicates...

Posted: Tue Mar 06, 2007 4:54 pm
by ray.wurlod
The most commonly used method, as far as I am aware, is to use a "fork join" - split the data stream into two and, on one of them, count each group. Downstream of that join the count back on to the original rows. You can use the count to determine the duplicates - the rows having the count greater than or equal to two (this might be a Filter stage WHERE clause, for example). To remove the duplicates, use a Remove Duplicates stage against another copy of the data. Make sure that your data are partitioned by the grouping key.

Posted: Tue Mar 06, 2007 5:08 pm
by kumar_s
Or use, Sort stage with KeyChangeColumn and filter those rows with the value 0 for duplicate stream, and 1's for Main stream.