Page 1 of 1

How to Remove Duplicates?

Posted: Tue Jun 05, 2007 10:29 am
by Minhajuddin
Hi,

I have a scenario where I need to delete all the duplicates from the source file.

Say I have this i/p data -->

EmpID
====
1
1
2
3
4
4

O/P Data-->

EmpID
====
2
3

Here since 1 and 4 are repeating I need to remove all the rows which have 1 or 4 as the EmpID.

I am implementing it this way -->
Here I have an Aggregator which counts the EmpIDs and then do a join on the key column EmpID.
In the next transformer I pass the rows only if count=1.
Image


Is there a better and more elegant way to do it?
Appreciate all your time and help.

Thanks a lot.[/img]

Posted: Tue Jun 05, 2007 10:55 am
by Maveric
You can use a Aggregation and filter stage to do this. Seq File ----> Agg -----> Filter. Hash partition on ID in Aggregator stage. Set the property to row count and key as ID. In Filter Stage give the where condition as row count = 1. Only if the row count is one carry the record forward.

Posted: Tue Jun 05, 2007 11:00 am
by Minhajuddin
Thank you Maverick.
But I forgot to mention that I had more than one column(Almost 10 columns). And I need to check the duplicates on a single column. Just for the sake of simplicity I posted it using a single column.

Thanks for all the help.

Posted: Tue Jun 05, 2007 11:24 am
by Maveric
One thing i am not sure is if u can propagate all the input fields to the output in the aggregator stage. If you can propagate all the input fields then the logic above should work. If you can propagate only the key fields and the output field then you need to join it to main data and then use filter stage.

Posted: Tue Jun 05, 2007 11:47 am
by Minhajuddin
Isn't there a special stage in Datastage which accomplishes this?

Posted: Tue Jun 05, 2007 2:05 pm
by chulett
Surely you don't mean the Remove Duplicates stage?

Posted: Tue Jun 05, 2007 2:28 pm
by Minhajuddin
I doubt if we can accomplish this with Remove Duplicate stages.

Posted: Tue Jun 05, 2007 2:58 pm
by chulett
:? Why would that be? You asked "How to Remove Duplicates" and "I have a scenario where I need to delete all the duplicates from the source file". I'm curious how exactly this stage fails to work for you. Have you tried it?

Posted: Tue Jun 05, 2007 3:25 pm
by ArndW
I think Maveric suggested a good solution. The question shouldn't have been "how to remove duplicates", since the goal is to remove duplicate rows as well as the original row if it has duplicates, which is a more complex operation.

Posted: Tue Jun 05, 2007 3:31 pm
by chulett
Ah... that was the key piece I was missing. Thanks Arnd.

Posted: Tue Jun 05, 2007 11:37 pm
by Minhajuddin
Hi Josh,
That sounds like a good idea. I will try it today and let you know.

And thanks everybody for all your help.

Posted: Wed Jun 06, 2007 12:30 am
by ray.wurlod
This might seem a bleedingly obvious question at this point, but why can't you use the Remove Duplicates stage?

Posted: Wed Jun 06, 2007 12:57 am
by ArndW
Ray - because the poster also wants to remove the original row as well if duplicates exist.

Posted: Wed Jun 06, 2007 1:37 am
by ray.wurlod
OK, count them (Aggregator) and run them through a Filter stage passing only those with count = 1.

Posted: Wed Jun 06, 2007 3:34 am
by Minhajuddin
Thank you Josh,

It's working fine! :D
Thank you all for all you valuable suggestions.