how to hash partition for joining?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

how to hash partition for joining?

Post by zulfi123786 »

Hi,

I have to make a join on 4 keys between 2 datasets. the keys are ACB,ACS,ACX,ACP. do i need to hash partition on all 4 keys or is it sufficient to make hash partition on 1 key.
hamzaqk
Participant
Posts: 249
Joined: Tue Apr 17, 2007 5:50 am
Location: islamabad

Post by hamzaqk »

hash partitioning and sort ascending on all key columns with the sort stage for links going into the join stage would be a good idea
Teradata Certified Master V2R5
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

Since join is a key-based operation, you have to partition the data in a manner so that matching keys are processed by the same node.

The usual practice is to hash partition by the full join key... but any partitioning that gets like keys on the same node is sufficient.

You could partition by ACB... by ACB,ACS... by ACB,ACS,ACX... by ACB,ACS,ACX,ACP.

In general, adding more fields to the partition key will reduce data skew, so you would normally hash partition by all join keys. If you hash partition by all join keys and still find that you have significant data skew, you can always hash partition with more keys than your join key. For example: ACB,ACS,ACX,ACP,<additional keys>.

The join also requires sorted input, so the entire join key must be in sort order. No matter how many keys are specified for the hash partitioning, you will always need to have the data sorted by ACB,ACS,ACX,ACP.

Mike
Post Reply