Capture Duplicates

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
bikan
Premium Member
Premium Member
Posts: 128
Joined: Thu Jun 08, 2006 5:27 am

Capture Duplicates

Post by bikan »

Here is the data:
State City
a 1
a 2
a 3
b 4
b 5
c 6

I want remove duplcates (whic i can do using remove duplicate stage). And also want to capture removed duplicates.
i,e,

Output 1:
a 1
b 4
c 6

Output2;
a 2
a 3
a 4
a 5
b 5

Any suggestions.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Search for it. It has been discussed time and again with different methods.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
us1aslam1us
Charter Member
Charter Member
Posts: 822
Joined: Sat Sep 17, 2005 5:25 pm
Location: USA

Re: Capture Duplicates

Post by us1aslam1us »

bikan wrote: Output2:
a 2
a 3
a 4
a 5
b 5
Any suggestions.
On what logic do want the extract that output2??? That doesn't seem to be just duplicates...
I haven't failed, I've found 10,000 ways that don't work.
Thomas Alva Edison(1847-1931)
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The most commonly used method, as far as I am aware, is to use a "fork join" - split the data stream into two and, on one of them, count each group. Downstream of that join the count back on to the original rows. You can use the count to determine the duplicates - the rows having the count greater than or equal to two (this might be a Filter stage WHERE clause, for example). To remove the duplicates, use a Remove Duplicates stage against another copy of the data. Make sure that your data are partitioned by the grouping key.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Or use, Sort stage with KeyChangeColumn and filter those rows with the value 0 for duplicate stream, and 1's for Main stream.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Post Reply