how to capture duplicate rows separetly in datastage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
bikan
Premium Member
Premium Member
Posts: 128
Joined: Thu Jun 08, 2006 5:27 am

how to capture duplicate rows separetly in datastage

Post by bikan »

Hi,

I want to capture the duplicate rows in my input file into separte file.could any one help me on this.

Thanks
Siva
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Fork your data stream into two using a Copy stage. Run one stream through an Aggregator that counts the number of rows for each combination of the duplicate-defining keys. Join the two streams (you now have detail row plus count). Using another Copy stage split the stream into two again. Run one stream through a Remove Duplicates stage, and run the other stream through a Filter stage (or Transformer stage) that blocks rows with a count of 1.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
saikir
Participant
Posts: 92
Joined: Wed Nov 08, 2006 12:25 am
Location: Minneapolis
Contact:

Post by saikir »

Hi,

If you want to capture the duplicate rows, you can always aggregate the data based on the key and put a filter having count>1 in the aggregator. In terms partitioning the data, i think you can partition the key based on hash.

Sai
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Sai, please elaborate on "put a filter in the Aggregator". I do not know how to do this, and would like to learn.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
saikir
Participant
Posts: 92
Joined: Wed Nov 08, 2006 12:25 am
Location: Minneapolis
Contact:

Post by saikir »

Hi Ray,

Very sorry for the post. May be was in my sleep while replying :) I actually meant get the count of the rows in the aggregator and put a filter in the Transformer stage having count(1) >1.

Extremely sorry

Sai
keshav0307
Premium Member
Premium Member
Posts: 783
Joined: Mon Jan 16, 2006 10:17 pm
Location: Sydney, Australia

Post by keshav0307 »

Search for "Capture duplicate" in this forum, you will find many suggested solutions.
Raamc
Premium Member
Premium Member
Posts: 87
Joined: Mon Aug 20, 2007 9:08 am

Post by Raamc »

Hi Sai,

I had a same requirement few days back and done in the following way.
I created two jobs for that.

1. Pass the data to Aggregator and count the number of records for the Key Combination. Collect the Key and Count into a intermediate file.

2. Read the Intermediate file and filter the records which are having Count<=1, then join with the master input file based on the keys.

Then your final o/p will be having the dulicate records.

This jobs are running in my production. If you get any better idea/thought let me know.
Thanks,
Raamc
Raamc
Premium Member
Premium Member
Posts: 87
Joined: Mon Aug 20, 2007 9:08 am

Post by Raamc »

Hi Sai,

I had a same requirement few days back and done in the following way.
I created two jobs for that.

1. Pass the data to Aggregator and count the number of records for the Key Combination. Collect the Key and Count into a intermediate file.

2. Read the Intermediate file and filter the records which are having Count<=1, then join with the master input file based on the keys.

Then your final o/p will be having the dulicate records.

This jobs are running in my production. If you get any better idea/thought let me know.
Thanks,
Raamc
Post Reply