Separte all duplicate records
Moderators: chulett, rschirm, roy
Separte all duplicate records
I know how to remove duplicates or separate the duplicate rows but
My requirement is need process all duplicate rows separately
Ex:
Input data:
123 AA
124 BB
124 BB
125 CC
126 DD
126 DD
126 DD
Out put in two datasets:
1st dataset should only have non duplicate rows:
123 AA
125 CC
2ND dataset should contain all duplicate rows.
124 BB
124 BB
126 DD
126 DD
126 DD
Thanks,
Raj
My requirement is need process all duplicate rows separately
Ex:
Input data:
123 AA
124 BB
124 BB
125 CC
126 DD
126 DD
126 DD
Out put in two datasets:
1st dataset should only have non duplicate rows:
123 AA
125 CC
2ND dataset should contain all duplicate rows.
124 BB
124 BB
126 DD
126 DD
126 DD
Thanks,
Raj
-
- Charter Member
- Posts: 42
- Joined: Wed Aug 18, 2004 2:49 pm
This is the test job that i did for your question.
after the source stage in Transformer take 3 stage variables.
Vari1: key1:Key2
Vari2: If Vari1 = Vari3 then 1 Else 0
Vari3: Key1 : Key2
Flag : If Var2=0 then 1 Else Flag+1
Then Create 3 output Links in First link Put a constraint Flag=2 and get the keys.
In the Second Link Put the Constrain Flag=1 and get the keys
In third link just get the keys don't use any constraint.
Join Link 1 & 2 with Join Stage and do a full outer Join, In my example i took Link 1 as Left.
After this stage put a transformer and take 2 output links which should have constraint that all your right key values should be '' OR Zeros as per your Datatype.
one link should be for a target file with is gone have only Non-Duplicate data values.
Other Link which also has the same constraint should be using to join to the third link that was created in Transformer one. In this Join stage give the key values, If the right link is your link 3 which you got from Transformer one, then after the join stage(use full outer) use the transformer stage. In this transformer output link use the constraint that all your left keys should be '' or Zeros as per your datatype. Then you get only the duplicate records.
the design will be like this
lnk1
Source File----> Transformer1 -------->
lnk 2 JN -------> Xfrmer---->Target
---------> |
|
lnk3 |
-----------------------------> JN---->Xfmr-->Target
Pleae make sure all your Partitioning is correct. Thanks.
after the source stage in Transformer take 3 stage variables.
Vari1: key1:Key2
Vari2: If Vari1 = Vari3 then 1 Else 0
Vari3: Key1 : Key2
Flag : If Var2=0 then 1 Else Flag+1
Then Create 3 output Links in First link Put a constraint Flag=2 and get the keys.
In the Second Link Put the Constrain Flag=1 and get the keys
In third link just get the keys don't use any constraint.
Join Link 1 & 2 with Join Stage and do a full outer Join, In my example i took Link 1 as Left.
After this stage put a transformer and take 2 output links which should have constraint that all your right key values should be '' OR Zeros as per your Datatype.
one link should be for a target file with is gone have only Non-Duplicate data values.
Other Link which also has the same constraint should be using to join to the third link that was created in Transformer one. In this Join stage give the key values, If the right link is your link 3 which you got from Transformer one, then after the join stage(use full outer) use the transformer stage. In this transformer output link use the constraint that all your left keys should be '' or Zeros as per your datatype. Then you get only the duplicate records.
the design will be like this
lnk1
Source File----> Transformer1 -------->
lnk 2 JN -------> Xfrmer---->Target
---------> |
|
lnk3 |
-----------------------------> JN---->Xfmr-->Target
Pleae make sure all your Partitioning is correct. Thanks.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Code: Select all
----> Sort ----> Copy -----------> Join --------> Copy ---> Filter1 --->
| ^ |
| | |
| | V
+-> Aggregator --+ Filter2 ---->
Filter1 isolates those rows for which key change column is 1.
Filter 2 isolates those rows for which key change column is not 1.
Aggregator generates count of each key value (omit if not required).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 91
- Joined: Wed Apr 20, 2005 7:59 pm
- Location: U.S.
Very easy solution:
Step1: Use SORT stage to sort data on key column and make sure to use Create Key Change Column = True.
Step2: Use a Transformer and in first output link Constraint say InputColumn.keyChange=1 and in second output link Constraint say InputColumn.keyChange=0.
Now you will have Two Output. The first one will have all the non duplicate rows (where keyChange=1) and the other one will have all the duplicate rows (where keyChange=0)
makes sense?
Step1: Use SORT stage to sort data on key column and make sure to use Create Key Change Column = True.
Step2: Use a Transformer and in first output link Constraint say InputColumn.keyChange=1 and in second output link Constraint say InputColumn.keyChange=0.
Now you will have Two Output. The first one will have all the non duplicate rows (where keyChange=1) and the other one will have all the duplicate rows (where keyChange=0)
makes sense?
-
- Premium Member
- Posts: 892
- Joined: Thu Oct 16, 2003 5:18 am
-
- Participant
- Posts: 527
- Joined: Thu Apr 19, 2007 1:25 am
- Location: Melbourne
Except Rajkraj's example shows he is looking for uniques vs non-uniques (why the aggregate & filter by count>1 is the way to go), not cardinal set vs dupes (which a key change column will give you).mobashshar wrote:Very easy solution:
Step1: Use SORT stage to sort data on key column and make sure to use Create Key Change Column = True.
Step2: Use a Transformer and in first output link Constraint say InputColumn.keyChange=1 and in second output link Constraint say InputColumn.keyChange=0.
Now you will have Two Output. The first one will have all the non duplicate rows (where keyChange=1) and the other one will have all the duplicate rows (where keyChange=0)
makes sense?