Parallel Sort and RemoveDuplicates

ag_ram · Post by **ag_ram** » Sat May 19, 2007 4:45 am

Hello Folks

The following are the stages in a Job

DataSet--------->Sort ---------->RemoveDuplicates------------>DataSet

There are 6 keys and 1 value column , I need the lease value column among these 6 keys and unique keys , so the input to the Sort is a HashPartition on the 6 keys and 1 value column , and on the remove duplicate it is repartition on the 6 keys that i want .

The volume of records in the input to sort can be huge ( upto 6 million) . Given this fact is repartioning suggested. Is there an alternative ?

JoshGeorge · Post by **JoshGeorge** » Sat May 19, 2007 5:53 am

Along with keys, sort value column ascending and remove duplicates in sort stage itself. What ever you were trying to achive by repartioning and remove dulicate stage can be done in sort stage itself. Why not try that way?

ray.wurlod · Post by **ray.wurlod** » Sat May 19, 2007 3:40 pm

Repartitioning is totally irrelevant to data volume. Who suggested it?

Repartitioning is added cost that should be avoided unless absolutely necessary, for example to achieve key or group adjacency or to match database (DB2) partitioning. Repartitioning is particularly costly on cluster or grid configurations, since the records need to be transferred across the network (rather than through shared memory) to their new partitions.

Luciana · Post by **Luciana** » Sun May 20, 2007 12:16 pm

Change the option Allow Duplicates in Sort stage to False for remove duplications.

DataSet--------->Sort ---------->DataSet