Using Transformer to Capture Duplicate Records

purush · Post by **purush** » Mon Apr 02, 2007 11:16 pm

Hi,

I want to filter the Duplicate records and also capture them.I can use a Remove Duplicate Stage by which I can remove duplicates but cannot capture the Duplicate Records. Can I use a Transformer for this.If so, Can anyone tell me how can I implement the Logic......

Purush

vmcburney · Post by **vmcburney** » Tue Apr 03, 2007 12:09 am

A transformer can use stage variables to remember values from the previous row. You do this by using one stage variable to compare and several to hold the comparison values. If you remember that Stage Variables are executed in the order they appear this makes it easy to compare adjacent rows of data.

In this example the first stage variable checks to see if the key from the previous row is different to the new key:
IsDuplicate: input.key1 : '|' : input.key2 = CompareKey
CompareKey: input.key1 : '|' : input.key2

This only works if you sort your data and partition it by that sort key. It is very fast. When IsDuplicate is true you output the row as a matched duplicate, otherwise you output it as unique.

csrazdan · Post by **csrazdan** » Tue Apr 03, 2007 10:03 pm

You can also perform this activity using SORT stage. Add a SORT stage to you job design. and sort it based on your key. Add SORT stage property Create Key Change Column. This property will add a column to your output link. The value of this column is 1 for the first record in the sort record group else the value is 0. Downstream in a transformer you can split the records based on the value of this column.

Hope it helps........

vmcburney · Post by **vmcburney** » Tue Apr 03, 2007 10:46 pm

Good tip, much better than mine since you have to sort the data to do a remove duplicates you might as well add the key change property when you sort.

ray.wurlod · Post by **ray.wurlod** » Tue Apr 03, 2007 11:20 pm

You can use Remove Duplicates and yet capture the duplicates. Upstream of the Remove Duplicate stage insert a Copy stage. On a second output of the copy stage insert an Aggregator stage that counts the rows, grouping by the same columns that the Remove Duplicates stage uses, then filter those rows for which the count is 2 or more. These are the duplicates. More efficiently, ignore the Remove Duplicates stage completely, and have a second output from the Filter stage filtering those rows for which the count is precisely 1.