Partitioning Method in Sort Stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
sid19
Participant
Posts: 64
Joined: Mon Jun 18, 2007 12:17 am
Location: kolkata

Partitioning Method in Sort Stage

Post by sid19 »

.
I am using Sort Stage the input has 5(A,B,C,D,E) columns and I need to sort on all of them in the order(B,A,D,E,C). and the rows coming in the sort stages are huge in number(around 200 millions of records).

So what will be the appropriate strategy for partition of Sort Stage so that we can get the result in least time?
Sid
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Ideally you have your data hash partitioned on your 5 columns, so that each node at runtime only needs to sort data in its own stream and does not need to repartition.
JeroenDmt
Premium Member
Premium Member
Posts: 107
Joined: Wed Oct 26, 2005 7:36 am

Post by JeroenDmt »

I've been wondering about this: do you need to partition on all 5 columns? if you would partition on the first column in the sort, already each node only needs to sort data in its own stream without needing to repartition? So I would think for the sorting it wouldn't matter if you partition on all 5 columns or only the first column in the sort.
Then what is the advantage of partitioning on all 5 columns?
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

No advantages to using all 5 at all, just performance disadvantages! You've hit the nail on the head, it is sufficient to partition on just one column.
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

Also,It depends on the type of operation done after the sort stage. E.g If you need to remove duplicates with 5 columns being key sorting on only one column will deliver incorrect results.
Ed Purcell
Premium Member
Premium Member
Posts: 23
Joined: Fri Mar 28, 2003 5:41 pm
Location: USA

Post by Ed Purcell »

balajisr wrote:Also,It depends on the type of operation done after the sort stage. E.g If you need to remove duplicates with 5 columns being key sorting on only one column will deliver incorrect results.
Also it's important whether the first column will give you an even split of the data when you use it to partition the data. You want approximately equal chunks of the data landing on each of your nodes to make the sorting easier.
EPCCTX
sid19
Participant
Posts: 64
Joined: Mon Jun 18, 2007 12:17 am
Location: kolkata

Post by sid19 »

Suppose first column has 5 distinct value so toatal 5 partition will come and we have only 2 node then how the data will distributed evenly on each node.

precisely I want to ask suppose our partition is not equal to number of node then how the data be distributed across the nodes
Sid
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

It will hash the 5 values to 2 nodes, one node will get 2/5 of the data, the other 3/5 of the data in your example. You can't get a better distribution unless you use round-robin, but then you would need to repartition again downstream for the sort so that approach is no good.
Post Reply