Page 1 of 1

Partitioning Method in Sort Stage

Posted: Mon Aug 13, 2007 10:43 pm
by sid19
.
I am using Sort Stage the input has 5(A,B,C,D,E) columns and I need to sort on all of them in the order(B,A,D,E,C). and the rows coming in the sort stages are huge in number(around 200 millions of records).

So what will be the appropriate strategy for partition of Sort Stage so that we can get the result in least time?

Posted: Mon Aug 13, 2007 10:52 pm
by ArndW
Ideally you have your data hash partitioned on your 5 columns, so that each node at runtime only needs to sort data in its own stream and does not need to repartition.

Posted: Tue Aug 14, 2007 1:29 am
by JeroenDmt
I've been wondering about this: do you need to partition on all 5 columns? if you would partition on the first column in the sort, already each node only needs to sort data in its own stream without needing to repartition? So I would think for the sorting it wouldn't matter if you partition on all 5 columns or only the first column in the sort.
Then what is the advantage of partitioning on all 5 columns?

Posted: Tue Aug 14, 2007 5:22 am
by ArndW
No advantages to using all 5 at all, just performance disadvantages! You've hit the nail on the head, it is sufficient to partition on just one column.

Posted: Tue Aug 14, 2007 7:04 am
by balajisr
Also,It depends on the type of operation done after the sort stage. E.g If you need to remove duplicates with 5 columns being key sorting on only one column will deliver incorrect results.

Posted: Tue Aug 14, 2007 11:25 am
by Ed Purcell
balajisr wrote:Also,It depends on the type of operation done after the sort stage. E.g If you need to remove duplicates with 5 columns being key sorting on only one column will deliver incorrect results.
Also it's important whether the first column will give you an even split of the data when you use it to partition the data. You want approximately equal chunks of the data landing on each of your nodes to make the sorting easier.

Posted: Thu Aug 16, 2007 8:51 am
by sid19
Suppose first column has 5 distinct value so toatal 5 partition will come and we have only 2 node then how the data will distributed evenly on each node.

precisely I want to ask suppose our partition is not equal to number of node then how the data be distributed across the nodes

Posted: Thu Aug 16, 2007 3:30 pm
by ArndW
It will hash the 5 values to 2 nodes, one node will get 2/5 of the data, the other 3/5 of the data in your example. You can't get a better distribution unless you use round-robin, but then you would need to repartition again downstream for the sort so that approach is no good.