Need for Sorting keys and Partitioning keys to be same

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Nagin
Charter Member
Charter Member
Posts: 89
Joined: Thu Jan 26, 2006 12:37 pm

Need for Sorting keys and Partitioning keys to be same

Post by Nagin »

I think this is more of a parallel concepts question.

What is the need for Sort keys and Partition keys to be same? Do they need to be same at all?

I have 10 columns in a data file and I want to sort the data based on just 5 columns, In the sort stage, can I just use one column to partition? Or Do I have to use all the 5 sort keys as partition keys too?

Thanks in advance for your help.
swades
Premium Member
Premium Member
Posts: 323
Joined: Mon Dec 04, 2006 11:52 pm

Re: Need for Sorting keys and Partitioning keys to be same

Post by swades »

Nagin wrote:Do they need to be same at all?
Not Necessary.
Nagin wrote:In the sort stage, can I just use one column to partition? Or Do I have to use all the 5 sort keys as partition keys too?
That is depends on your in-stream data.

Explain here what you are trying to achieve by sorting data? removing duplicate data, joining data (join stage) ??
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

A moment's analytical thought will show that it is sufficient (always) to partition solely on the first sort key.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jcthornton
Premium Member
Premium Member
Posts: 79
Joined: Thu Mar 22, 2007 4:58 pm
Location: USA

Post by jcthornton »

To expand on Ray's point, you can partition on any subset of the sort keys with two limitation: the sort and partition keys appear in the same order with no gaps and you start with the first sort key.

For example, if you are sorting on key1 - key5 in order, your partition keys choices are:
key1
key1, key2
key1, key2, key3
key1, key2, key3, key4
key1, key2, key3, key4, key5

If you want to partition on key4, key1 then you have to change your sort key order to:
key4, key1, [key2/key3/key5 - any order on these last 3 keys]

Now, this is not a 100% rule. It is possible to contrive a scenario where you could want to partition and sort on entirely different keys. However, that is such a rare exception that you should never encounter it.

On a side note, when you talk about just using one key to partition by, hopefully this is because you are looking at downstream stages to help guide your partitioning logic. If partitioning by a single field means you will not need to repartition later, that is often preferred.

The advantage of partitioning by more fields is it can help avoid unbalanced numbers between nodes for processing if the partition key(s) have fewer unique values than anticipated.
Jack Thornton
----------------
Spectacular achievement is always preceded by spectacular preparation - Robert H. Schuller
Post Reply