Hi All,
I need a clarification with respect to remove duplicate stage.
Scenario 1:
Job Design:
Source Dataset ---> Sort stage ----> peek stage
As we all know, using sort stage we can remove the duplicates. In this case when I checked $APT_DUMP_SCORE of my job I could not able to see separate operator(Remdup Operator).so can I assume tsort operator is performing both sorting & remove duplicate operation? or internally Remdup operator is assigned to remove duplicates?
Scenario 2:
Job Design:
Source Dataset ---> Sort stage ---->Remove Duplicate Stage----> peek stage
In this case i could able to see separate operator has been assigned to sort stage as well for remove duplicate stage.
which approach is better in terms of performance? Thanks in advance.
Thanks,
Ram
Understanding of Remove Duplicate Stage Execution
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 40
- Joined: Tue Nov 11, 2008 5:49 am
Understanding of Remove Duplicate Stage Execution
Knowledge is Fair,execution is matter!
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
They're pretty close to identical in terms of performance. What's different is the functionality - with the Remove Duplicates stage you can specify which record to keep from each group (first or last) whereas with a unique sort you cannot specify which record to keep from each group.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 40
- Joined: Tue Nov 11, 2008 5:49 am