Page 1 of 1

how to capture duplicate rows separetly in datastage

Posted: Wed Aug 13, 2008 11:59 pm
by bikan
Hi,

I want to capture the duplicate rows in my input file into separte file.could any one help me on this.

Thanks
Siva

Posted: Thu Aug 14, 2008 12:23 am
by ray.wurlod
Fork your data stream into two using a Copy stage. Run one stream through an Aggregator that counts the number of rows for each combination of the duplicate-defining keys. Join the two streams (you now have detail row plus count). Using another Copy stage split the stream into two again. Run one stream through a Remove Duplicates stage, and run the other stream through a Filter stage (or Transformer stage) that blocks rows with a count of 1.

Posted: Thu Aug 14, 2008 12:24 am
by saikir
Hi,

If you want to capture the duplicate rows, you can always aggregate the data based on the key and put a filter having count>1 in the aggregator. In terms partitioning the data, i think you can partition the key based on hash.

Sai

Posted: Thu Aug 14, 2008 2:35 am
by ray.wurlod
Sai, please elaborate on "put a filter in the Aggregator". I do not know how to do this, and would like to learn.

Posted: Thu Aug 14, 2008 2:42 am
by saikir
Hi Ray,

Very sorry for the post. May be was in my sleep while replying :) I actually meant get the count of the rows in the aggregator and put a filter in the Transformer stage having count(1) >1.

Extremely sorry

Sai

Posted: Thu Aug 14, 2008 3:32 am
by keshav0307
Search for "Capture duplicate" in this forum, you will find many suggested solutions.

Posted: Thu Aug 14, 2008 1:35 pm
by Raamc
Hi Sai,

I had a same requirement few days back and done in the following way.
I created two jobs for that.

1. Pass the data to Aggregator and count the number of records for the Key Combination. Collect the Key and Count into a intermediate file.

2. Read the Intermediate file and filter the records which are having Count<=1, then join with the master input file based on the keys.

Then your final o/p will be having the dulicate records.

This jobs are running in my production. If you get any better idea/thought let me know.

Posted: Thu Aug 14, 2008 1:36 pm
by Raamc
Hi Sai,

I had a same requirement few days back and done in the following way.
I created two jobs for that.

1. Pass the data to Aggregator and count the number of records for the Key Combination. Collect the Key and Count into a intermediate file.

2. Read the Intermediate file and filter the records which are having Count<=1, then join with the master input file based on the keys.

Then your final o/p will be having the dulicate records.

This jobs are running in my production. If you get any better idea/thought let me know.