Page 1 of 1

Parallel Sort and RemoveDuplicates

Posted: Sat May 19, 2007 4:45 am
by ag_ram
Hello Folks

The following are the stages in a Job

DataSet--------->Sort ---------->RemoveDuplicates------------>DataSet

There are 6 keys and 1 value column , I need the lease value column among these 6 keys and unique keys , so the input to the Sort is a HashPartition on the 6 keys and 1 value column , and on the remove duplicate it is repartition on the 6 keys that i want .

The volume of records in the input to sort can be huge ( upto 6 million) . Given this fact is repartioning suggested. Is there an alternative ?

Posted: Sat May 19, 2007 5:53 am
by JoshGeorge
Along with keys, sort value column ascending and remove duplicates in sort stage itself. What ever you were trying to achive by repartioning and remove dulicate stage can be done in sort stage itself. Why not try that way?

Posted: Sat May 19, 2007 3:40 pm
by ray.wurlod
Repartitioning is totally irrelevant to data volume. Who suggested it?

Repartitioning is added cost that should be avoided unless absolutely necessary, for example to achieve key or group adjacency or to match database (DB2) partitioning. Repartitioning is particularly costly on cluster or grid configurations, since the records need to be transferred across the network (rather than through shared memory) to their new partitions.

Posted: Sun May 20, 2007 12:16 pm
by Luciana
Change the option Allow Duplicates in Sort stage to False for remove duplications.

DataSet--------->Sort ---------->DataSet