How to Remove Duplicates?

Minhajuddin · Post by **Minhajuddin** » Tue Jun 05, 2007 10:29 am

Hi,

I have a scenario where I need to delete all the duplicates from the source file.

Say I have this i/p data -->

EmpID
====
1
1
2
3
4
4

O/P Data-->

EmpID
====
2
3

Here since 1 and 4 are repeating I need to remove all the rows which have 1 or 4 as the EmpID.

I am implementing it this way -->
Here I have an Aggregator which counts the EmpIDs and then do a join on the key column EmpID.
In the next transformer I pass the rows only if count=1.

Is there a better and more elegant way to do it?
Appreciate all your time and help.

Thanks a lot.[/img]

Maveric · Post by **Maveric** » Tue Jun 05, 2007 10:55 am

You can use a Aggregation and filter stage to do this. Seq File ----> Agg -----> Filter. Hash partition on ID in Aggregator stage. Set the property to row count and key as ID. In Filter Stage give the where condition as row count = 1. Only if the row count is one carry the record forward.

Minhajuddin · Post by **Minhajuddin** » Tue Jun 05, 2007 11:00 am

Thank you Maverick.
But I forgot to mention that I had more than one column(Almost 10 columns). And I need to check the duplicates on a single column. Just for the sake of simplicity I posted it using a single column.

Thanks for all the help.

Maveric · Post by **Maveric** » Tue Jun 05, 2007 11:24 am

One thing i am not sure is if u can propagate all the input fields to the output in the aggregator stage. If you can propagate all the input fields then the logic above should work. If you can propagate only the key fields and the output field then you need to join it to main data and then use filter stage.

Minhajuddin · Post by **Minhajuddin** » Tue Jun 05, 2007 11:47 am

Isn't there a special stage in Datastage which accomplishes this?

chulett · Post by **chulett** » Tue Jun 05, 2007 2:05 pm

Surely you don't mean the Remove Duplicates stage?

Minhajuddin · Post by **Minhajuddin** » Tue Jun 05, 2007 2:28 pm

I doubt if we can accomplish this with Remove Duplicate stages.

chulett · Post by **chulett** » Tue Jun 05, 2007 2:58 pm

Why would that be? You asked "How to Remove Duplicates" and "I have a scenario where I need to delete all the duplicates from the source file". I'm curious how exactly this stage fails to work for you. Have you tried it?

ArndW · Post by **ArndW** » Tue Jun 05, 2007 3:25 pm

I think Maveric suggested a good solution. The question shouldn't have been "how to remove duplicates", since the goal is to remove duplicate rows as well as the original row if it has duplicates, which is a more complex operation.

chulett · Post by **chulett** » Tue Jun 05, 2007 3:31 pm

Ah... that was the key piece I was missing. Thanks Arnd.

Minhajuddin · Post by **Minhajuddin** » Tue Jun 05, 2007 11:37 pm

Hi Josh,
That sounds like a good idea. I will try it today and let you know.

And thanks everybody for all your help.

ray.wurlod · Post by **ray.wurlod** » Wed Jun 06, 2007 12:30 am

This might seem a bleedingly obvious question at this point, but why can't you use the Remove Duplicates stage?

ArndW · Post by **ArndW** » Wed Jun 06, 2007 12:57 am

Ray - because the poster also wants to remove the original row as well if duplicates exist.

ray.wurlod · Post by **ray.wurlod** » Wed Jun 06, 2007 1:37 am

OK, count them (Aggregator) and run them through a Filter stage passing only those with count = 1.

Minhajuddin · Post by **Minhajuddin** » Wed Jun 06, 2007 3:34 am

Thank you Josh,

It's working fine! :D
Thank you all for all you valuable suggestions.