Separte all duplicate records

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
rajkraj
Premium Member
Premium Member
Posts: 98
Joined: Wed Jun 15, 2005 1:41 pm

Separte all duplicate records

Post by rajkraj »

I know how to remove duplicates or separate the duplicate rows but
My requirement is need process all duplicate rows separately

Ex:
Input data:

123 AA
124 BB
124 BB
125 CC
126 DD
126 DD
126 DD

Out put in two datasets:

1st dataset should only have non duplicate rows:
123 AA
125 CC

2ND dataset should contain all duplicate rows.
124 BB
124 BB
126 DD
126 DD
126 DD

Thanks,
Raj
ratikmishra1
Charter Member
Charter Member
Posts: 42
Joined: Wed Aug 18, 2004 2:49 pm

Post by ratikmishra1 »

Join with one input from the file and the other via aggregate stage followed by a filter stage whould work in this case.
rajkraj
Premium Member
Premium Member
Posts: 98
Joined: Wed Jun 15, 2005 1:41 pm

Post by rajkraj »

Thanks for reply, can you explian it more.

anyone has any other methods to achive this?
ratikmishra1 wrote:Join with one input from the file and the other via aggregate stage followed by a filter stage whould work in this case.
neena
Participant
Posts: 90
Joined: Mon Mar 31, 2003 4:32 pm

Post by neena »

This is the test job that i did for your question.

after the source stage in Transformer take 3 stage variables.

Vari1: key1:Key2
Vari2: If Vari1 = Vari3 then 1 Else 0
Vari3: Key1 : Key2
Flag : If Var2=0 then 1 Else Flag+1

Then Create 3 output Links in First link Put a constraint Flag=2 and get the keys.
In the Second Link Put the Constrain Flag=1 and get the keys
In third link just get the keys don't use any constraint.

Join Link 1 & 2 with Join Stage and do a full outer Join, In my example i took Link 1 as Left.
After this stage put a transformer and take 2 output links which should have constraint that all your right key values should be '' OR Zeros as per your Datatype.
one link should be for a target file with is gone have only Non-Duplicate data values.
Other Link which also has the same constraint should be using to join to the third link that was created in Transformer one. In this Join stage give the key values, If the right link is your link 3 which you got from Transformer one, then after the join stage(use full outer) use the transformer stage. In this transformer output link use the constraint that all your left keys should be '' or Zeros as per your datatype. Then you get only the duplicate records.

the design will be like this



lnk1
Source File----> Transformer1 -------->
lnk 2 JN -------> Xfrmer---->Target
---------> |
|
lnk3 |
-----------------------------> JN---->Xfmr-->Target

Pleae make sure all your Partitioning is correct. Thanks.
neena
Participant
Posts: 90
Joined: Mon Mar 31, 2003 4:32 pm

Post by neena »

sorry the design is not correctly posted.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Edit your post. Wrap it (the design) in Code tags then use Preview until it IS right.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Code: Select all

----> Sort ----> Copy -----------> Join --------> Copy  --->  Filter1 --->
                   |                 ^             |
                   |                 |             |
                   |                 |             V
                   +->  Aggregator --+           Filter2  ---->
Sort stage generates key change column.
Filter1 isolates those rows for which key change column is 1.
Filter 2 isolates those rows for which key change column is not 1.
Aggregator generates count of each key value (omit if not required).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
rajkraj
Premium Member
Premium Member
Posts: 98
Joined: Wed Jun 15, 2005 1:41 pm

Post by rajkraj »

Thank you all.
vilasini
Participant
Posts: 32
Joined: Sun Dec 09, 2007 11:48 pm

Post by vilasini »

Please provide how did you resolve it.It will help for others.
mobashshar
Participant
Posts: 91
Joined: Wed Apr 20, 2005 7:59 pm
Location: U.S.

Post by mobashshar »

Very easy solution:
Step1: Use SORT stage to sort data on key column and make sure to use Create Key Change Column = True.
Step2: Use a Transformer and in first output link Constraint say InputColumn.keyChange=1 and in second output link Constraint say InputColumn.keyChange=0.

Now you will have Two Output. The first one will have all the non duplicate rows (where keyChange=1) and the other one will have all the duplicate rows (where keyChange=0)

makes sense?
Sreenivasulu
Premium Member
Premium Member
Posts: 892
Joined: Thu Oct 16, 2003 5:18 am

Post by Sreenivasulu »

You need to check how it works in multiple partitions and multiple nodes

Regards
Sreeni
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

mobashshar wrote:Very easy solution:
Step1: Use SORT stage to sort data on key column and make sure to use Create Key Change Column = True.
Step2: Use a Transformer and in first output link Constraint say InputColumn.keyChange=1 and in second output link Constraint say InputColumn.keyChange=0.

Now you will have Two Output. The first one will have all the non duplicate rows (where keyChange=1) and the other one will have all the duplicate rows (where keyChange=0)

makes sense?
Except Rajkraj's example shows he is looking for uniques vs non-uniques (why the aggregate & filter by count>1 is the way to go), not cardinal set vs dupes (which a key change column will give you).
Post Reply