Page 1 of 1

Understanding of Remove Duplicate Stage Execution

Posted: Wed May 23, 2012 8:19 am
by ramsubbiah
Hi All,

I need a clarification with respect to remove duplicate stage.


Scenario 1:
Job Design:

Source Dataset ---> Sort stage ----> peek stage

As we all know, using sort stage we can remove the duplicates. In this case when I checked $APT_DUMP_SCORE of my job I could not able to see separate operator(Remdup Operator).so can I assume tsort operator is performing both sorting & remove duplicate operation? or internally Remdup operator is assigned to remove duplicates?

Scenario 2:
Job Design:

Source Dataset ---> Sort stage ---->Remove Duplicate Stage----> peek stage

In this case i could able to see separate operator has been assigned to sort stage as well for remove duplicate stage.

which approach is better in terms of performance? Thanks in advance.

Thanks,
Ram

Posted: Wed May 23, 2012 4:37 pm
by ray.wurlod
They're pretty close to identical in terms of performance. What's different is the functionality - with the Remove Duplicates stage you can specify which record to keep from each group (first or last) whereas with a unique sort you cannot specify which record to keep from each group.

Posted: Thu May 24, 2012 5:10 am
by ramsubbiah
Hi Ray,
Thanks for your Response
Since I don't have premium membership, I am not able to see your complete response. anyway I will upgrade my membership and let you the know the outcome.

Thanks,
Ram