Understanding of Remove Duplicate Stage Execution
Posted: Wed May 23, 2012 8:19 am
Hi All,
I need a clarification with respect to remove duplicate stage.
Scenario 1:
Job Design:
Source Dataset ---> Sort stage ----> peek stage
As we all know, using sort stage we can remove the duplicates. In this case when I checked $APT_DUMP_SCORE of my job I could not able to see separate operator(Remdup Operator).so can I assume tsort operator is performing both sorting & remove duplicate operation? or internally Remdup operator is assigned to remove duplicates?
Scenario 2:
Job Design:
Source Dataset ---> Sort stage ---->Remove Duplicate Stage----> peek stage
In this case i could able to see separate operator has been assigned to sort stage as well for remove duplicate stage.
which approach is better in terms of performance? Thanks in advance.
Thanks,
Ram
I need a clarification with respect to remove duplicate stage.
Scenario 1:
Job Design:
Source Dataset ---> Sort stage ----> peek stage
As we all know, using sort stage we can remove the duplicates. In this case when I checked $APT_DUMP_SCORE of my job I could not able to see separate operator(Remdup Operator).so can I assume tsort operator is performing both sorting & remove duplicate operation? or internally Remdup operator is assigned to remove duplicates?
Scenario 2:
Job Design:
Source Dataset ---> Sort stage ---->Remove Duplicate Stage----> peek stage
In this case i could able to see separate operator has been assigned to sort stage as well for remove duplicate stage.
which approach is better in terms of performance? Thanks in advance.
Thanks,
Ram