Page 1 of 1

Need for Sorting keys and Partitioning keys to be same

Posted: Tue Nov 10, 2009 1:14 pm
by Nagin
I think this is more of a parallel concepts question.

What is the need for Sort keys and Partition keys to be same? Do they need to be same at all?

I have 10 columns in a data file and I want to sort the data based on just 5 columns, In the sort stage, can I just use one column to partition? Or Do I have to use all the 5 sort keys as partition keys too?

Thanks in advance for your help.

Re: Need for Sorting keys and Partitioning keys to be same

Posted: Tue Nov 10, 2009 1:35 pm
by swades
Nagin wrote:Do they need to be same at all?
Not Necessary.
Nagin wrote:In the sort stage, can I just use one column to partition? Or Do I have to use all the 5 sort keys as partition keys too?
That is depends on your in-stream data.

Explain here what you are trying to achieve by sorting data? removing duplicate data, joining data (join stage) ??

Posted: Tue Nov 10, 2009 1:41 pm
by ray.wurlod
A moment's analytical thought will show that it is sufficient (always) to partition solely on the first sort key.

Posted: Tue Nov 10, 2009 3:39 pm
by jcthornton
To expand on Ray's point, you can partition on any subset of the sort keys with two limitation: the sort and partition keys appear in the same order with no gaps and you start with the first sort key.

For example, if you are sorting on key1 - key5 in order, your partition keys choices are:
key1
key1, key2
key1, key2, key3
key1, key2, key3, key4
key1, key2, key3, key4, key5

If you want to partition on key4, key1 then you have to change your sort key order to:
key4, key1, [key2/key3/key5 - any order on these last 3 keys]

Now, this is not a 100% rule. It is possible to contrive a scenario where you could want to partition and sort on entirely different keys. However, that is such a rare exception that you should never encounter it.

On a side note, when you talk about just using one key to partition by, hopefully this is because you are looking at downstream stages to help guide your partitioning logic. If partitioning by a single field means you will not need to repartition later, that is often preferred.

The advantage of partitioning by more fields is it can help avoid unbalanced numbers between nodes for processing if the partition key(s) have fewer unique values than anticipated.