Hi,
I have to make a join on 4 keys between 2 datasets. the keys are ACB,ACS,ACX,ACP. do i need to hash partition on all 4 keys or is it sufficient to make hash partition on 1 key.
how to hash partition for joining?
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 730
- Joined: Tue Nov 04, 2008 10:14 am
- Location: Bangalore
Since join is a key-based operation, you have to partition the data in a manner so that matching keys are processed by the same node.
The usual practice is to hash partition by the full join key... but any partitioning that gets like keys on the same node is sufficient.
You could partition by ACB... by ACB,ACS... by ACB,ACS,ACX... by ACB,ACS,ACX,ACP.
In general, adding more fields to the partition key will reduce data skew, so you would normally hash partition by all join keys. If you hash partition by all join keys and still find that you have significant data skew, you can always hash partition with more keys than your join key. For example: ACB,ACS,ACX,ACP,<additional keys>.
The join also requires sorted input, so the entire join key must be in sort order. No matter how many keys are specified for the hash partitioning, you will always need to have the data sorted by ACB,ACS,ACX,ACP.
Mike
The usual practice is to hash partition by the full join key... but any partitioning that gets like keys on the same node is sufficient.
You could partition by ACB... by ACB,ACS... by ACB,ACS,ACX... by ACB,ACS,ACX,ACP.
In general, adding more fields to the partition key will reduce data skew, so you would normally hash partition by all join keys. If you hash partition by all join keys and still find that you have significant data skew, you can always hash partition with more keys than your join key. For example: ACB,ACS,ACX,ACP,<additional keys>.
The join also requires sorted input, so the entire join key must be in sort order. No matter how many keys are specified for the hash partitioning, you will always need to have the data sorted by ACB,ACS,ACX,ACP.
Mike