Join partitioning requirement

Madhavan VM · Post by **Madhavan VM** » Thu Nov 15, 2007 2:50 pm

Hi,

I have to join two datasets based on keys L1 and D1, where L1 is a one up number and D1 is a date.

I believe that for the join to work, the input partitioning on links should be hashed and sorted on the keys.

My input data is coming as sorted on key L1. Hence, is it possible for me to join on L1 and D1 but using Hash partitioning method on the key L1 alone

To make it clear: I will be doing hash partitioning on the keys L1 and do sorting on L1 and D1. Would my join be succesfull

If not could you explain why

My rationale for this question is: Since the hashing algoritham is used on the main key, L1, all the L1 records would have the same hashed key value and would eventually fall in the same node. Then its a matter of sorting on the keys, L1 and D1, to find the join.

I have used the join all through my jobs with the golden rule that the data should be hashed and sorted on the keys. But, when a colleague of mine prompted me as to why we can't do it on the main key and then sort to get the desired result; I was not able to give a proper explanation.

your response is appreciated.

ray.wurlod · Post by **ray.wurlod** » Thu Nov 15, 2007 5:11 pm

You need to partition on both key columns, so that for any value of the key [L1,D1] all corresponding values will be found on the same partition.

That said, if you partition one input on [L1,D1] and one on [L1] alone, then you would still achieve the principle of key adjacency, because there are no values of [L1] on any other partition for any particular set of [L1,D1] values.

So, yes (ceteris paribus) your join will probably work.