How to Remove Duplicates?

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

Post Reply
Minhajuddin
Participant
Posts: 467
Joined: Tue Mar 20, 2007 6:36 am
Location: Chennai
Contact:

How to Remove Duplicates?

Post by Minhajuddin »

Hi,

I have a scenario where I need to delete all the duplicates from the source file.

Say I have this i/p data -->

EmpID
====
1
1
2
3
4
4

O/P Data-->

EmpID
====
2
3

Here since 1 and 4 are repeating I need to remove all the rows which have 1 or 4 as the EmpID.

I am implementing it this way -->
Here I have an Aggregator which counts the EmpIDs and then do a join on the key column EmpID.
In the next transformer I pass the rows only if count=1.
Image


Is there a better and more elegant way to do it?
Appreciate all your time and help.

Thanks a lot.[/img]
Minhajuddin

<a href="http://feeds.feedburner.com/~r/MyExperi ... ~6/2"><img src="http://feeds.feedburner.com/MyExperienc ... lrow.3.gif" alt="My experiences with this DLROW" border="0"></a>
Maveric
Participant
Posts: 388
Joined: Tue Mar 13, 2007 1:28 am

Post by Maveric »

You can use a Aggregation and filter stage to do this. Seq File ----> Agg -----> Filter. Hash partition on ID in Aggregator stage. Set the property to row count and key as ID. In Filter Stage give the where condition as row count = 1. Only if the row count is one carry the record forward.
Minhajuddin
Participant
Posts: 467
Joined: Tue Mar 20, 2007 6:36 am
Location: Chennai
Contact:

Post by Minhajuddin »

Thank you Maverick.
But I forgot to mention that I had more than one column(Almost 10 columns). And I need to check the duplicates on a single column. Just for the sake of simplicity I posted it using a single column.

Thanks for all the help.
Minhajuddin

<a href="http://feeds.feedburner.com/~r/MyExperi ... ~6/2"><img src="http://feeds.feedburner.com/MyExperienc ... lrow.3.gif" alt="My experiences with this DLROW" border="0"></a>
Maveric
Participant
Posts: 388
Joined: Tue Mar 13, 2007 1:28 am

Post by Maveric »

One thing i am not sure is if u can propagate all the input fields to the output in the aggregator stage. If you can propagate all the input fields then the logic above should work. If you can propagate only the key fields and the output field then you need to join it to main data and then use filter stage.
Minhajuddin
Participant
Posts: 467
Joined: Tue Mar 20, 2007 6:36 am
Location: Chennai
Contact:

Post by Minhajuddin »

Isn't there a special stage in Datastage which accomplishes this?
Minhajuddin

<a href="http://feeds.feedburner.com/~r/MyExperi ... ~6/2"><img src="http://feeds.feedburner.com/MyExperienc ... lrow.3.gif" alt="My experiences with this DLROW" border="0"></a>
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Surely you don't mean the Remove Duplicates stage?
-craig

"You can never have too many knives" -- Logan Nine Fingers
Minhajuddin
Participant
Posts: 467
Joined: Tue Mar 20, 2007 6:36 am
Location: Chennai
Contact:

Post by Minhajuddin »

I doubt if we can accomplish this with Remove Duplicate stages.
Minhajuddin

<a href="http://feeds.feedburner.com/~r/MyExperi ... ~6/2"><img src="http://feeds.feedburner.com/MyExperienc ... lrow.3.gif" alt="My experiences with this DLROW" border="0"></a>
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

:? Why would that be? You asked "How to Remove Duplicates" and "I have a scenario where I need to delete all the duplicates from the source file". I'm curious how exactly this stage fails to work for you. Have you tried it?
-craig

"You can never have too many knives" -- Logan Nine Fingers
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I think Maveric suggested a good solution. The question shouldn't have been "how to remove duplicates", since the goal is to remove duplicate rows as well as the original row if it has duplicates, which is a more complex operation.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Ah... that was the key piece I was missing. Thanks Arnd.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Minhajuddin
Participant
Posts: 467
Joined: Tue Mar 20, 2007 6:36 am
Location: Chennai
Contact:

Post by Minhajuddin »

Hi Josh,
That sounds like a good idea. I will try it today and let you know.

And thanks everybody for all your help.
Minhajuddin

<a href="http://feeds.feedburner.com/~r/MyExperi ... ~6/2"><img src="http://feeds.feedburner.com/MyExperienc ... lrow.3.gif" alt="My experiences with this DLROW" border="0"></a>
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

This might seem a bleedingly obvious question at this point, but why can't you use the Remove Duplicates stage?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Ray - because the poster also wants to remove the original row as well if duplicates exist.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

OK, count them (Aggregator) and run them through a Filter stage passing only those with count = 1.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Minhajuddin
Participant
Posts: 467
Joined: Tue Mar 20, 2007 6:36 am
Location: Chennai
Contact:

Post by Minhajuddin »

Thank you Josh,

It's working fine! :D
Thank you all for all you valuable suggestions.
Minhajuddin

<a href="http://feeds.feedburner.com/~r/MyExperi ... ~6/2"><img src="http://feeds.feedburner.com/MyExperienc ... lrow.3.gif" alt="My experiences with this DLROW" border="0"></a>
Post Reply