Separte all duplicate records

rajkraj · Post by **rajkraj** » Fri Sep 26, 2008 8:10 am

I know how to remove duplicates or separate the duplicate rows but
My requirement is need process all duplicate rows separately

Ex:
Input data:

123 AA
124 BB
124 BB
125 CC
126 DD
126 DD
126 DD

Out put in two datasets:

1st dataset should only have non duplicate rows:
123 AA
125 CC

2ND dataset should contain all duplicate rows.
124 BB
124 BB
126 DD
126 DD
126 DD

Thanks,
Raj

ratikmishra1 · Post by **ratikmishra1** » Fri Sep 26, 2008 11:45 am

Join with one input from the file and the other via aggregate stage followed by a filter stage whould work in this case.

rajkraj · Post by **rajkraj** » Fri Sep 26, 2008 1:55 pm

Thanks for reply, can you explian it more.

anyone has any other methods to achive this?

ratikmishra1 wrote:Join with one input from the file and the other via aggregate stage followed by a filter stage whould work in this case.

neena · Post by **neena** » Fri Sep 26, 2008 1:59 pm

This is the test job that i did for your question.

after the source stage in Transformer take 3 stage variables.

Vari1: key1:Key2
Vari2: If Vari1 = Vari3 then 1 Else 0
Vari3: Key1 : Key2
Flag : If Var2=0 then 1 Else Flag+1

Then Create 3 output Links in First link Put a constraint Flag=2 and get the keys.
In the Second Link Put the Constrain Flag=1 and get the keys
In third link just get the keys don't use any constraint.

Join Link 1 & 2 with Join Stage and do a full outer Join, In my example i took Link 1 as Left.
After this stage put a transformer and take 2 output links which should have constraint that all your right key values should be '' OR Zeros as per your Datatype.
one link should be for a target file with is gone have only Non-Duplicate data values.
Other Link which also has the same constraint should be using to join to the third link that was created in Transformer one. In this Join stage give the key values, If the right link is your link 3 which you got from Transformer one, then after the join stage(use full outer) use the transformer stage. In this transformer output link use the constraint that all your left keys should be '' or Zeros as per your datatype. Then you get only the duplicate records.

the design will be like this

lnk1
Source File----> Transformer1 -------->
lnk 2 JN -------> Xfrmer---->Target
---------> |
|
lnk3 |
-----------------------------> JN---->Xfmr-->Target

Pleae make sure all your Partitioning is correct. Thanks.

neena · Post by **neena** » Fri Sep 26, 2008 7:04 pm

sorry the design is not correctly posted.

ray.wurlod · Post by **ray.wurlod** » Fri Sep 26, 2008 9:12 pm

Edit your post. Wrap it (the design) in Code tags then use Preview until it IS right.

ray.wurlod · Post by **ray.wurlod** » Fri Sep 26, 2008 9:21 pm

Code: Select all

----> Sort ----> Copy -----------> Join --------> Copy  --->  Filter1 --->
                   |                 ^             |
                   |                 |             |
                   |                 |             V
                   +->  Aggregator --+           Filter2  ---->

Sort stage generates key change column.
Filter1 isolates those rows for which key change column is 1.
Filter 2 isolates those rows for which key change column is not 1.
Aggregator generates count of each key value (omit if not required).

rajkraj · Post by **rajkraj** » Mon Sep 29, 2008 2:05 pm

Thank you all.

vilasini · Post by **vilasini** » Tue Sep 30, 2008 9:35 am

Please provide how did you resolve it.It will help for others.

mobashshar · Post by **mobashshar** » Wed Oct 27, 2010 11:00 am

Very easy solution:
Step1: Use SORT stage to sort data on key column and make sure to use Create Key Change Column = True.
Step2: Use a Transformer and in first output link Constraint say InputColumn.keyChange=1 and in second output link Constraint say InputColumn.keyChange=0.

Now you will have Two Output. The first one will have all the non duplicate rows (where keyChange=1) and the other one will have all the duplicate rows (where keyChange=0)

makes sense?

Sreenivasulu · Post by **Sreenivasulu** » Wed Oct 27, 2010 12:13 pm

You need to check how it works in multiple partitions and multiple nodes

Regards
Sreeni

stuartjvnorton · Post by **stuartjvnorton** » Wed Oct 27, 2010 10:16 pm

mobashshar wrote:Very easy solution:
Step1: Use SORT stage to sort data on key column and make sure to use Create Key Change Column = True.
Step2: Use a Transformer and in first output link Constraint say InputColumn.keyChange=1 and in second output link Constraint say InputColumn.keyChange=0.

Now you will have Two Output. The first one will have all the non duplicate rows (where keyChange=1) and the other one will have all the duplicate rows (where keyChange=0)

makes sense?

Except Rajkraj's example shows he is looking for uniques vs non-uniques (why the aggregate & filter by count>1 is the way to go), not cardinal set vs dupes (which a key change column will give you).