Hello Folks
The following are the stages in a Job
DataSet--------->Sort ---------->RemoveDuplicates------------>DataSet
There are 6 keys and 1 value column , I need the lease value column among these 6 keys and unique keys , so the input to the Sort is a HashPartition on the 6 keys and 1 value column , and on the remove duplicate it is repartition on the 6 keys that i want .
The volume of records in the input to sort can be huge ( upto 6 million) . Given this fact is repartioning suggested. Is there an alternative ?
Parallel Sort and RemoveDuplicates
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 612
- Joined: Thu May 03, 2007 4:59 am
- Location: Melbourne
Along with keys, sort value column ascending and remove duplicates in sort stage itself. What ever you were trying to achive by repartioning and remove dulicate stage can be done in sort stage itself. Why not try that way?
Joshy George
<a href="http://www.linkedin.com/in/joshygeorge1" ><img src="http://www.linkedin.com/img/webpromo/bt ... _80x15.gif" width="80" height="15" border="0"></a>
<a href="http://www.linkedin.com/in/joshygeorge1" ><img src="http://www.linkedin.com/img/webpromo/bt ... _80x15.gif" width="80" height="15" border="0"></a>
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Repartitioning is totally irrelevant to data volume. Who suggested it?
Repartitioning is added cost that should be avoided unless absolutely necessary, for example to achieve key or group adjacency or to match database (DB2) partitioning. Repartitioning is particularly costly on cluster or grid configurations, since the records need to be transferred across the network (rather than through shared memory) to their new partitions.
Repartitioning is added cost that should be avoided unless absolutely necessary, for example to achieve key or group adjacency or to match database (DB2) partitioning. Repartitioning is particularly costly on cluster or grid configurations, since the records need to be transferred across the network (rather than through shared memory) to their new partitions.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.