Capture Duplicates

bikan · Post by **bikan** » Tue Mar 06, 2007 7:30 am

Here is the data:
State City
a 1
a 2
a 3
b 4
b 5
c 6

I want remove duplcates (whic i can do using remove duplicate stage). And also want to capture removed duplicates.
i,e,

Output 1:
a 1
b 4
c 6

Output2;
a 2
a 3
a 4
a 5
b 5

Any suggestions.

DSguru2B · Post by **DSguru2B** » Tue Mar 06, 2007 7:43 am

Search for it. It has been discussed time and again with different methods.

us1aslam1us · Post by **us1aslam1us** » Tue Mar 06, 2007 4:12 pm

bikan wrote: Output2:
a 2
a 3
a 4
a 5
b 5
Any suggestions.

On what logic do want the extract that output2??? That doesn't seem to be just duplicates...

ray.wurlod · Post by **ray.wurlod** » Tue Mar 06, 2007 4:54 pm

The most commonly used method, as far as I am aware, is to use a "fork join" - split the data stream into two and, on one of them, count each group. Downstream of that join the count back on to the original rows. You can use the count to determine the duplicates - the rows having the count greater than or equal to two (this might be a Filter stage WHERE clause, for example). To remove the duplicates, use a Remove Duplicates stage against another copy of the data. Make sure that your data are partitioned by the grouping key.

kumar_s · Post by **kumar_s** » Tue Mar 06, 2007 5:08 pm

Or use, Sort stage with KeyChangeColumn and filter those rows with the value 0 for duplicate stream, and 1's for Main stream.

DSXchange

Capture Duplicates

Capture Duplicates

Re: Capture Duplicates