.
I am using Sort Stage the input has 5(A,B,C,D,E) columns and I need to sort on all of them in the order(B,A,D,E,C). and the rows coming in the sort stages are huge in number(around 200 millions of records).
So what will be the appropriate strategy for partition of Sort Stage so that we can get the result in least time?
Partitioning Method in Sort Stage
Moderators: chulett, rschirm, roy
Ideally you have your data hash partitioned on your 5 columns, so that each node at runtime only needs to sort data in its own stream and does not need to repartition.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
I've been wondering about this: do you need to partition on all 5 columns? if you would partition on the first column in the sort, already each node only needs to sort data in its own stream without needing to repartition? So I would think for the sorting it wouldn't matter if you partition on all 5 columns or only the first column in the sort.
Then what is the advantage of partitioning on all 5 columns?
Then what is the advantage of partitioning on all 5 columns?
No advantages to using all 5 at all, just performance disadvantages! You've hit the nail on the head, it is sufficient to partition on just one column.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Premium Member
- Posts: 23
- Joined: Fri Mar 28, 2003 5:41 pm
- Location: USA
Also it's important whether the first column will give you an even split of the data when you use it to partition the data. You want approximately equal chunks of the data landing on each of your nodes to make the sorting easier.balajisr wrote:Also,It depends on the type of operation done after the sort stage. E.g If you need to remove duplicates with 5 columns being key sorting on only one column will deliver incorrect results.
EPCCTX
It will hash the 5 values to 2 nodes, one node will get 2/5 of the data, the other 3/5 of the data in your example. You can't get a better distribution unless you use round-robin, but then you would need to repartition again downstream for the sort so that approach is no good.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>