Hash Partioining - doubt

vij · Post by **vij** » Wed Jan 17, 2007 7:25 am

Hi all,

I have a job which process about 80 Million to 100 Million of records.
In a joiner stage, I have selected the hash partioning and used the key column, col1 (2000 different possible values for this column) and doing a sort and partition based on that key column value.

I have 2 doubts:

1. I have another one column, col2 which has about 60 possible values.Which column I should use for partitioning, col1 or col2?

2.Also, I have not sorted the incoming data, I was said that if I sort the data using the sorter stage, before this partition, performance will be better. Is it correct?

Thanks in advance!

kris007 · Post by **kris007** » Wed Jan 17, 2007 8:29 am

Use a sort stage before a join stage and use hash partitioning in the sort stage. Then use the "same" partitioning in the join stage which follows. It is always better to use a sort stage for such heavy data rather than sorting within the join stage.

Answer to your first question, you have to sort and partition your data based upon the keys(columns) you would be joining.

ray.wurlod · Post by **ray.wurlod** » Wed Jan 17, 2007 3:56 pm

Hash partition and sort identically on all inputs on all the join keys. Partitioning so that all instances of any one value occur on the same node, sorting so that changes in value are quickly detected and memory can be freed.

kumar_s · Post by **kumar_s** » Wed Jan 17, 2007 5:31 pm

The data sets input to the Join stage must be key partitioned and sorted. You might have noticed this in document. So if you Joining key is Col1 and if you do a partition on Col2, it wont help you.