difference in Sort w/RD and Remove Duplicate w/Sort

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
swathi Singamareddygari
Participant
Posts: 48
Joined: Fri Feb 29, 2008 1:09 am
Location: Bangalore

difference in Sort w/RD and Remove Duplicate w/Sort

Post by swathi Singamareddygari »

Hi all,

when we have a remove duplicate option in sort stage, why we
have a remove duplicate stage in PX, thought it is
recommended to sort data before using a remove duplicate
stage.

If any one knows please answer
Thanks&Regards
S.Swathi
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

Intervew Question?

There is specific stage in PX for almost all necessary transformation generally applied during migration, hence there is no question about existence of Remove duplicate stage. However there are questions about existence of sort stage as you can use Inlink sort, which uses the same tsort operator as sort stage. But still the explicit sort stage has added fuctionality, and also good to maintain.

A unique sort takes the first record it encounters depending on the key defined for sort. Now lets say you want the first/last record for a key for data sorted on date, in this case you have to sort data on key+date and then remove duplicate on key. There are other ways to do the same but in my opinion a combination of sort and remove duplicate will be most suitable solution.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You can avoid the Remove Duplicates stage if you don't care which record from each group is kept; if you want to specify that the first or last record from each group is kept then you need a Remove Duplicates stage.

Remove Duplicates relies on data being sorted and partitioned on the key that identifies duplicates.

An explicit Sort stage gives you control over how much memory is allocated for sorting, and a number of other benefits that are not useful in the current scenario. Memory allocated for sorting can also be controlled by setting the APT_TSORT_STRESS_BLOCKSIZE environment variable.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply