Understanding of Remove Duplicate Stage Execution

ramsubbiah · Post by **ramsubbiah** » Wed May 23, 2012 8:19 am

Hi All,

I need a clarification with respect to remove duplicate stage.

Scenario 1:
Job Design:

Source Dataset ---> Sort stage ----> peek stage

As we all know, using sort stage we can remove the duplicates. In this case when I checked $APT_DUMP_SCORE of my job I could not able to see separate operator(Remdup Operator).so can I assume tsort operator is performing both sorting & remove duplicate operation? or internally Remdup operator is assigned to remove duplicates?

Scenario 2:
Job Design:

Source Dataset ---> Sort stage ---->Remove Duplicate Stage----> peek stage

In this case i could able to see separate operator has been assigned to sort stage as well for remove duplicate stage.

which approach is better in terms of performance? Thanks in advance.

Thanks,
Ram

ray.wurlod · Post by **ray.wurlod** » Wed May 23, 2012 4:37 pm

They're pretty close to identical in terms of performance. What's different is the functionality - with the Remove Duplicates stage you can specify which record to keep from each group (first or last) whereas with a unique sort you cannot specify which record to keep from each group.

ramsubbiah · Post by **ramsubbiah** » Thu May 24, 2012 5:10 am

Hi Ray,
Thanks for your Response
Since I don't have premium membership, I am not able to see your complete response. anyway I will upgrade my membership and let you the know the outcome.

Thanks,
Ram