Capturing duplicates

adityavarma · Post by **adityavarma** » Wed Jul 28, 2010 4:20 am

Hi,
I have an requirement where in if my source file is having duplicates, i need to capture all the duplicates and load into another table.

for ex:
101,test,austraila
101,test,austraila
202,test1,india

I need to capture both first two records and load into a table.

I have done it through the sort and filter stage, but in this one record is going to one link and other to another link.

Can anyone please suggest on how to proceed on this.

ArndW · Post by **ArndW** » Wed Jul 28, 2010 4:24 am

Depending upon what you want to do there are several stages and methods available to you.

It is not quite clear from your description what you want to achieve. In your example, the "101" row is duplicated. Do you want both records to go down one link, or the first to go down one link and subsequent duplicates to go another path?

adityavarma · Post by **adityavarma** » Wed Jul 28, 2010 4:33 am

AndrW,

I want both the records(101) to go into one link.

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Wed Jul 28, 2010 4:52 am

Assuming you are looking to extract records with duplicate ID values, you can obtain it by forking your data stream - one into aggregator followed by referenced back into the main thread to get the IDs with count > 1.

If the volume is low, I suggest running uniq command to obtain this count values.

abhilashnair · Post by **abhilashnair** » Wed Jul 28, 2010 5:02 am

You can use an aggrgator stage here...Take a count of the key column..if count greater than 1 have it routed to a output via filter

ds_dwh · Post by **ds_dwh** » Wed Jul 28, 2010 5:23 am

Hi,

take sort stage set properties like allow duplicates=T,
creatate key change column=T,

in filter u can write conditions like
whre creatate keychange cloumn=0---->to capture duplicate records

where creatate keychange cloumn=1---->to capture unique records

chulett · Post by **chulett** » Wed Jul 28, 2010 5:27 am

You missed the "capture all duplicates" part of the requirement, hence the other suggestions for a fork join design.

adityavarma · Post by **adityavarma** » Wed Jul 28, 2010 5:29 am

thank you all for your responses

Requirement is that I want to capture all the duplicates into one link.

ex:
10001 sainath
10001 andrw
10002 aditya
10003 dsdwh

I want to send both the records of 10001(sainath and andrw) to be sent to one link and records 10002 ,10003 to another link.

chulett · Post by **chulett** » Wed Jul 28, 2010 6:00 am

We know... and you've been given the solution for that.

DSXchange

Capturing duplicates

Capturing duplicates

Re: Capturing duplicates