remove duplicate stage compared to sort stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
hiltsmi
Participant
Posts: 20
Joined: Thu Aug 04, 2005 9:03 am

remove duplicate stage compared to sort stage

Post by hiltsmi »

I want to remove duplicate records from a data set. I was going to use the remove duplicates stage. The documentation says the data must be sorted before removing duplicates. However the sorter stage also has the capability of removing duplicates.

Why would I want to use the removed duplicate records stage when the sort stage can do it?

What does the remove duplicate records stage do that the sort doesn't?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Most stages' input links also allow you to specify sorting and removal of duplicates, so you can get by without either of the Remove Duplicates stage or the Sort stage. So the choice is really about using the lowest impact tool; if your data are already sorted, then the Remove Duplicates stage is probably the least impact. The Sort stage gives you the ability to allot more memory to the sorting process.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
gsherry1
Charter Member
Charter Member
Posts: 173
Joined: Fri Jun 17, 2005 8:31 am
Location: Canada

Post by gsherry1 »

Hello MHilts,

Other Differences:

1. Remove Duplicates makes it much more visual that you're deduping, making it more consistent with the GUI based development.

2. The Sort stage only allows you to at best choose retain the first record from each group, but remove duplicates allows you to choose either first or last.
Post Reply