.
I am using Sort Stage the input has 5(A,B,C,D,E) columns and I need to sort on all of them in the order(B,A,D,E,C). and the rows coming in the sort stages are huge in number(around 200 millions of records).
So what will be the appropriate strategy for partition of Sort Stage so that we can get the result in least time?
Ideally you have your data hash partitioned on your 5 columns, so that each node at runtime only needs to sort data in its own stream and does not need to repartition.
I've been wondering about this: do you need to partition on all 5 columns? if you would partition on the first column in the sort, already each node only needs to sort data in its own stream without needing to repartition? So I would think for the sorting it wouldn't matter if you partition on all 5 columns or only the first column in the sort.
Then what is the advantage of partitioning on all 5 columns?
No advantages to using all 5 at all, just performance disadvantages! You've hit the nail on the head, it is sufficient to partition on just one column.
Also,It depends on the type of operation done after the sort stage. E.g If you need to remove duplicates with 5 columns being key sorting on only one column will deliver incorrect results.
balajisr wrote:Also,It depends on the type of operation done after the sort stage. E.g If you need to remove duplicates with 5 columns being key sorting on only one column will deliver incorrect results.
Also it's important whether the first column will give you an even split of the data when you use it to partition the data. You want approximately equal chunks of the data landing on each of your nodes to make the sorting easier.
Suppose first column has 5 distinct value so toatal 5 partition will come and we have only 2 node then how the data will distributed evenly on each node.
precisely I want to ask suppose our partition is not equal to number of node then how the data be distributed across the nodes
It will hash the 5 values to 2 nodes, one node will get 2/5 of the data, the other 3/5 of the data in your example. You can't get a better distribution unless you use round-robin, but then you would need to repartition again downstream for the sort so that approach is no good.