Page 1 of 1

Eliminate duplicates from file and capture in flatfile

Posted: Fri Jul 24, 2009 12:29 pm
by Jagan617
Can anyone assist me in capturing all the duplicate records coming from csv file into a separate file without using aggregator.

Re: Eliminate duplicates from file and capture in flatfile

Posted: Fri Jul 24, 2009 12:36 pm
by ddevdutt
You should be able to achieve this using stage variables in the transformer

Posted: Fri Jul 24, 2009 12:45 pm
by ArndW
Stage variables will work but you would also need to sort the data.

Posted: Fri Jul 24, 2009 12:47 pm
by Sainath.Srinivasan
Please define duplicates.

Are you looking for identical rows or duplicates by specific columns ?

Use sort stage and filter

Posted: Fri Jul 24, 2009 1:28 pm
by ssbhas
I think the best way to capture duplicates is by using "Sort Stage".

Define the columns for which you want to find duplicates on as your sort keys and enable "Create Key Change Column". This will create additional column "keyChange". All rows with the value of '0' (zero) in "keyChange" are duplicates.

P.S.: Make sure you hash partition on key columns.

Re: Use sort stage and filter

Posted: Fri Jul 24, 2009 2:25 pm
by ddevdutt
DataStage Server Edition is being used :D
ssbhas wrote: P.S.: Make sure you hash partition on key columns.

Eliminate duplicates from file and capture in flatfile

Posted: Fri Jul 24, 2009 7:41 pm
by Jagan617
duplicates by specific columns ?

Eliminate duplicates from file and capture in flatfile

Posted: Fri Jul 24, 2009 7:46 pm
by Jagan617
ArndW wrote:Stage variables will work but you would also need to sort the data. ...



can you please tell what is the approach in transformer using stage variable after data being sorted.

Posted: Sat Jul 25, 2009 12:50 am
by ArndW
You would need to answer Srini's question in order to get a good answer. Basically, stage variables are used to store values from the previous row and compare them to the current row. You would compare those columns you wish to detect duplicates on and, using constraints, skip rows with duplicates. Again, you would need to sort the data so that duplicates can be detected.