how to capture duplicate rows separetly in datastage

bikan · Post by **bikan** » Wed Aug 13, 2008 11:59 pm

Hi,

I want to capture the duplicate rows in my input file into separte file.could any one help me on this.

Thanks
Siva

ray.wurlod · Post by **ray.wurlod** » Thu Aug 14, 2008 12:23 am

Fork your data stream into two using a Copy stage. Run one stream through an Aggregator that counts the number of rows for each combination of the duplicate-defining keys. Join the two streams (you now have detail row plus count). Using another Copy stage split the stream into two again. Run one stream through a Remove Duplicates stage, and run the other stream through a Filter stage (or Transformer stage) that blocks rows with a count of 1.

saikir · Post by **saikir** » Thu Aug 14, 2008 12:24 am

Hi,

If you want to capture the duplicate rows, you can always aggregate the data based on the key and put a filter having count>1 in the aggregator. In terms partitioning the data, i think you can partition the key based on hash.

Sai

ray.wurlod · Post by **ray.wurlod** » Thu Aug 14, 2008 2:35 am

Sai, please elaborate on "put a filter in the Aggregator". I do not know how to do this, and would like to learn.

saikir · Post by **saikir** » Thu Aug 14, 2008 2:42 am

Hi Ray,

Very sorry for the post. May be was in my sleep while replying

I actually meant get the count of the rows in the aggregator and put a filter in the Transformer stage having count(1) >1.

Extremely sorry

Sai

keshav0307 · Post by **keshav0307** » Thu Aug 14, 2008 3:32 am

Search for "Capture duplicate" in this forum, you will find many suggested solutions.

Raamc · Post by **Raamc** » Thu Aug 14, 2008 1:35 pm

Hi Sai,

I had a same requirement few days back and done in the following way.
I created two jobs for that.

1. Pass the data to Aggregator and count the number of records for the Key Combination. Collect the Key and Count into a intermediate file.

2. Read the Intermediate file and filter the records which are having Count<=1, then join with the master input file based on the keys.

Then your final o/p will be having the dulicate records.

This jobs are running in my production. If you get any better idea/thought let me know.

Raamc · Post by **Raamc** » Thu Aug 14, 2008 1:36 pm

Hi Sai,

I had a same requirement few days back and done in the following way.
I created two jobs for that.

1. Pass the data to Aggregator and count the number of records for the Key Combination. Collect the Key and Count into a intermediate file.

2. Read the Intermediate file and filter the records which are having Count<=1, then join with the master input file based on the keys.

Then your final o/p will be having the dulicate records.

This jobs are running in my production. If you get any better idea/thought let me know.