Hi,
I want to filter the Duplicate records and also capture them.I can use a Remove Duplicate Stage by which I can remove duplicates but cannot capture the Duplicate Records. Can I use a Transformer for this.If so, Can anyone tell me how can I implement the Logic......
Purush
Using Transformer to Capture Duplicate Records
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
A transformer can use stage variables to remember values from the previous row. You do this by using one stage variable to compare and several to hold the comparison values. If you remember that Stage Variables are executed in the order they appear this makes it easy to compare adjacent rows of data.
In this example the first stage variable checks to see if the key from the previous row is different to the new key:
IsDuplicate: input.key1 : '|' : input.key2 = CompareKey
CompareKey: input.key1 : '|' : input.key2
This only works if you sort your data and partition it by that sort key. It is very fast. When IsDuplicate is true you output the row as a matched duplicate, otherwise you output it as unique.
In this example the first stage variable checks to see if the key from the previous row is different to the new key:
IsDuplicate: input.key1 : '|' : input.key2 = CompareKey
CompareKey: input.key1 : '|' : input.key2
This only works if you sort your data and partition it by that sort key. It is very fast. When IsDuplicate is true you output the row as a matched duplicate, otherwise you output it as unique.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
You can also perform this activity using SORT stage. Add a SORT stage to you job design. and sort it based on your key. Add SORT stage property Create Key Change Column. This property will add a column to your output link. The value of this column is 1 for the first record in the sort record group else the value is 0. Downstream in a transformer you can split the records based on the value of this column.
Hope it helps........
Hope it helps........
Last edited by csrazdan on Thu Apr 26, 2007 2:44 pm, edited 1 time in total.
Assume everything I say or do is positive
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
Good tip, much better than mine since you have to sort the data to do a remove duplicates you might as well add the key change property when you sort.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
You can use Remove Duplicates and yet capture the duplicates. Upstream of the Remove Duplicate stage insert a Copy stage. On a second output of the copy stage insert an Aggregator stage that counts the rows, grouping by the same columns that the Remove Duplicates stage uses, then filter those rows for which the count is 2 or more. These are the duplicates. More efficiently, ignore the Remove Duplicates stage completely, and have a second output from the Filter stage filtering those rows for which the count is precisely 1.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.