Parallel Sort and RemoveDuplicates
Posted: Sat May 19, 2007 4:45 am
Hello Folks
The following are the stages in a Job
DataSet--------->Sort ---------->RemoveDuplicates------------>DataSet
There are 6 keys and 1 value column , I need the lease value column among these 6 keys and unique keys , so the input to the Sort is a HashPartition on the 6 keys and 1 value column , and on the remove duplicate it is repartition on the 6 keys that i want .
The volume of records in the input to sort can be huge ( upto 6 million) . Given this fact is repartioning suggested. Is there an alternative ?
The following are the stages in a Job
DataSet--------->Sort ---------->RemoveDuplicates------------>DataSet
There are 6 keys and 1 value column , I need the lease value column among these 6 keys and unique keys , so the input to the Sort is a HashPartition on the 6 keys and 1 value column , and on the remove duplicate it is repartition on the 6 keys that i want .
The volume of records in the input to sort can be huge ( upto 6 million) . Given this fact is repartioning suggested. Is there an alternative ?