I think this is more of a parallel concepts question.
What is the need for Sort keys and Partition keys to be same? Do they need to be same at all?
I have 10 columns in a data file and I want to sort the data based on just 5 columns, In the sort stage, can I just use one column to partition? Or Do I have to use all the 5 sort keys as partition keys too?
Thanks in advance for your help.
Need for Sorting keys and Partitioning keys to be same
Moderators: chulett, rschirm, roy
Re: Need for Sorting keys and Partitioning keys to be same
Not Necessary.Nagin wrote:Do they need to be same at all?
That is depends on your in-stream data.Nagin wrote:In the sort stage, can I just use one column to partition? Or Do I have to use all the 5 sort keys as partition keys too?
Explain here what you are trying to achieve by sorting data? removing duplicate data, joining data (join stage) ??
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Premium Member
- Posts: 79
- Joined: Thu Mar 22, 2007 4:58 pm
- Location: USA
To expand on Ray's point, you can partition on any subset of the sort keys with two limitation: the sort and partition keys appear in the same order with no gaps and you start with the first sort key.
For example, if you are sorting on key1 - key5 in order, your partition keys choices are:
key1
key1, key2
key1, key2, key3
key1, key2, key3, key4
key1, key2, key3, key4, key5
If you want to partition on key4, key1 then you have to change your sort key order to:
key4, key1, [key2/key3/key5 - any order on these last 3 keys]
Now, this is not a 100% rule. It is possible to contrive a scenario where you could want to partition and sort on entirely different keys. However, that is such a rare exception that you should never encounter it.
On a side note, when you talk about just using one key to partition by, hopefully this is because you are looking at downstream stages to help guide your partitioning logic. If partitioning by a single field means you will not need to repartition later, that is often preferred.
The advantage of partitioning by more fields is it can help avoid unbalanced numbers between nodes for processing if the partition key(s) have fewer unique values than anticipated.
For example, if you are sorting on key1 - key5 in order, your partition keys choices are:
key1
key1, key2
key1, key2, key3
key1, key2, key3, key4
key1, key2, key3, key4, key5
If you want to partition on key4, key1 then you have to change your sort key order to:
key4, key1, [key2/key3/key5 - any order on these last 3 keys]
Now, this is not a 100% rule. It is possible to contrive a scenario where you could want to partition and sort on entirely different keys. However, that is such a rare exception that you should never encounter it.
On a side note, when you talk about just using one key to partition by, hopefully this is because you are looking at downstream stages to help guide your partitioning logic. If partitioning by a single field means you will not need to repartition later, that is often preferred.
The advantage of partitioning by more fields is it can help avoid unbalanced numbers between nodes for processing if the partition key(s) have fewer unique values than anticipated.
Jack Thornton
----------------
Spectacular achievement is always preceded by spectacular preparation - Robert H. Schuller
----------------
Spectacular achievement is always preceded by spectacular preparation - Robert H. Schuller