how to hash partition for joining?

zulfi123786 · Post by **zulfi123786** » Wed Jan 14, 2009 10:40 pm

Hi,

I have to make a join on 4 keys between 2 datasets. the keys are ACB,ACS,ACX,ACP. do i need to hash partition on all 4 keys or is it sufficient to make hash partition on 1 key.

hamzaqk · Post by **hamzaqk** » Wed Jan 14, 2009 10:43 pm

hash partitioning and sort ascending on all key columns with the sort stage for links going into the join stage would be a good idea

Mike · Post by **Mike** » Thu Jan 15, 2009 8:32 am

Since join is a key-based operation, you have to partition the data in a manner so that matching keys are processed by the same node.

The usual practice is to hash partition by the full join key... but any partitioning that gets like keys on the same node is sufficient.

You could partition by ACB... by ACB,ACS... by ACB,ACS,ACX... by ACB,ACS,ACX,ACP.

In general, adding more fields to the partition key will reduce data skew, so you would normally hash partition by all join keys. If you hash partition by all join keys and still find that you have significant data skew, you can always hash partition with more keys than your join key. For example: ACB,ACS,ACX,ACP,<additional keys>.

The join also requires sorted input, so the entire join key must be in sort order. No matter how many keys are specified for the hash partitioning, you will always need to have the data sorted by ACB,ACS,ACX,ACP.

Mike