Using Transformer to Capture Duplicate Records

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
purush
Participant
Posts: 16
Joined: Sat Nov 25, 2006 5:34 am
Location: Hyderabad

Using Transformer to Capture Duplicate Records

Post by purush »

Hi,

I want to filter the Duplicate records and also capture them.I can use a Remove Duplicate Stage by which I can remove duplicates but cannot capture the Duplicate Records. Can I use a Transformer for this.If so, Can anyone tell me how can I implement the Logic......

Purush
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

A transformer can use stage variables to remember values from the previous row. You do this by using one stage variable to compare and several to hold the comparison values. If you remember that Stage Variables are executed in the order they appear this makes it easy to compare adjacent rows of data.

In this example the first stage variable checks to see if the key from the previous row is different to the new key:
IsDuplicate: input.key1 : '|' : input.key2 = CompareKey
CompareKey: input.key1 : '|' : input.key2

This only works if you sort your data and partition it by that sort key. It is very fast. When IsDuplicate is true you output the row as a matched duplicate, otherwise you output it as unique.
csrazdan
Participant
Posts: 127
Joined: Wed May 12, 2004 6:03 pm
Location: Chicago IL

Post by csrazdan »

You can also perform this activity using SORT stage. Add a SORT stage to you job design. and sort it based on your key. Add SORT stage property Create Key Change Column. This property will add a column to your output link. The value of this column is 1 for the first record in the sort record group else the value is 0. Downstream in a transformer you can split the records based on the value of this column.

Hope it helps........
Last edited by csrazdan on Thu Apr 26, 2007 2:44 pm, edited 1 time in total.
Assume everything I say or do is positive
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Good tip, much better than mine since you have to sort the data to do a remove duplicates you might as well add the key change property when you sort.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You can use Remove Duplicates and yet capture the duplicates. Upstream of the Remove Duplicate stage insert a Copy stage. On a second output of the copy stage insert an Aggregator stage that counts the rows, grouping by the same columns that the Remove Duplicates stage uses, then filter those rows for which the count is 2 or more. These are the duplicates. More efficiently, ignore the Remove Duplicates stage completely, and have a second output from the Filter stage filtering those rows for which the count is precisely 1.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply