remove duplicate stage compared to sort stage

hiltsmi · Post by **hiltsmi** » Wed Nov 02, 2005 3:23 pm

I want to remove duplicate records from a data set. I was going to use the remove duplicates stage. The documentation says the data must be sorted before removing duplicates. However the sorter stage also has the capability of removing duplicates.

Why would I want to use the removed duplicate records stage when the sort stage can do it?

What does the remove duplicate records stage do that the sort doesn't?

ray.wurlod · Post by **ray.wurlod** » Thu Nov 03, 2005 2:33 am

Most stages' input links also allow you to specify sorting and removal of duplicates, so you can get by without either of the Remove Duplicates stage or the Sort stage. So the choice is really about using the lowest impact tool; if your data are already sorted, then the Remove Duplicates stage is probably the least impact. The Sort stage gives you the ability to allot more memory to the sorting process.

gsherry1 · Post by **gsherry1** » Thu Nov 03, 2005 2:27 pm

Hello MHilts,

Other Differences:

1. Remove Duplicates makes it much more visual that you're deduping, making it more consistent with the GUI based development.

2. The Sort stage only allows you to at best choose retain the first record from each group, but remove duplicates allows you to choose either first or last.