Page 1 of 1

Storing duplicate records

Posted: Thu Aug 17, 2006 1:17 am
by aaikat
Using 'Remove Duplicate' stage we can remove the duplicates. Now since we can not have Reject link with that stage so how to collect (and store them in a file) the duplicate records that are being rejected.

Posted: Thu Aug 17, 2006 1:32 am
by Daddy Doma
You can create a "Manual" Remove Duplicates by using a Transformer.

Ensure your data is sorted and partitioned on the grouping keys before entering the stage. Then, use stage variables to assess each new attribute against the preceeding rows.
  • - Create a stage variable called NewID and set as current row ID.
    - Evaluate OldID against NewID.
    - Create a stage variable called OldID and set as current row ID.
This is possible because stage variables are processes in sequence. When you get to row 2, the key from row 1 is still in the OldID field, so it can compare them.

Posted: Thu Aug 17, 2006 1:35 am
by Daddy Doma
Oops! Forgot to say: The value of your assessment stage variable sends the rows to different output links from the transformer. If the OldID and NewID match, send it as a duplicate. If they don't match, then it is the first record and any other duplicate records will follow it.

Posted: Thu Aug 17, 2006 4:40 am
by kumar_s
This has been discussed many times. One other option, Sort and Enable the KeyChange option. And in transformer, you can track the value of this field and filter based on it.
Or Remove the duplicate, and find a difference with the original file using Difference stage.
And there are many options too.