difference in Sort w/RD and Remove Duplicate w/Sort

swathi Singamareddygari · Wed Apr 07, 2010 5:15 am

Hi all,

when we have a remove duplicate option in sort stage, why we
have a remove duplicate stage in PX, thought it is
recommended to sort data before using a remove duplicate
stage.

If any one knows please answer

priyadarshikunal · Post by **priyadarshikunal** » Wed Apr 07, 2010 6:30 am

Intervew Question?

There is specific stage in PX for almost all necessary transformation generally applied during migration, hence there is no question about existence of Remove duplicate stage. However there are questions about existence of sort stage as you can use Inlink sort, which uses the same tsort operator as sort stage. But still the explicit sort stage has added fuctionality, and also good to maintain.

A unique sort takes the first record it encounters depending on the key defined for sort. Now lets say you want the first/last record for a key for data sorted on date, in this case you have to sort data on key+date and then remove duplicate on key. There are other ways to do the same but in my opinion a combination of sort and remove duplicate will be most suitable solution.

ray.wurlod · Post by **ray.wurlod** » Wed Apr 07, 2010 2:34 pm

You can avoid the Remove Duplicates stage if you don't care which record from each group is kept; if you want to specify that the first or last record from each group is kept then you need a Remove Duplicates stage.

Remove Duplicates relies on data being sorted and partitioned on the key that identifies duplicates.

An explicit Sort stage gives you control over how much memory is allocated for sorting, and a number of other benefits that are not useful in the current scenario. Memory allocated for sorting can also be controlled by setting the APT_TSORT_STRESS_BLOCKSIZE environment variable.