sort and hash partitioning

dnat · Post by **dnat** » Sun Feb 07, 2010 11:05 pm

Hi

I have a question on sorting before join.

Is it good to use explicit sort stage or doing a sort in the stage itself(by checking "Perform Sort") and do a hash partitioning?

If we are doing hash partitioning, then why do we need to sort the data, as the hash partitioning is done on keys?

If i just use Auto Partitioning, then is it fine rather than doing a sort and hash?

ray.wurlod · Post by **ray.wurlod** » Sun Feb 07, 2010 11:44 pm

Sorting is required because of the way that the Join stage works.

Even though the hash partitioning directs every row with value "X" to the same partition, there's no guarantee that they're adjacent rows in the data.

Auto partitioning on a Join stage will give you "Hash". This may not be optimal, particularly if the join key is a single integer, in which case Modulus will be more efficient a partitioning algorithm.

Marley777 · Post by **Marley777** » Thu Dec 30, 2010 2:30 pm

If I'm using a join stage to join two links having 10 million unsorted rows on each link, and set auto partitoning will auto partitioning take care of both sort and hash partitioning?

Here is how my env variable are set
APT_NO_SORT_INSERTION = FALSE
APT_NO_PART_INSERTION = FALSE

ray.wurlod · Post by **ray.wurlod** » Thu Dec 30, 2010 4:06 pm

No, you are almost certain to get incorrect results. And performance is likely to be woeful or you'll get "row out of order" fatal errors.

jwiles · Post by **jwiles** » Sun Jan 02, 2011 1:19 am

Marley777 wrote:If I'm using a join stage to join two links having 10 million unsorted rows on each link, and set auto partitoning will auto partitioning take care of both sort and hash partitioning?

Here is how my env variable are set
APT_NO_SORT_INSERTION = FALSE
APT_NO_PART_INSERTION = FALSE

Partitioning and Sorting are two separate operations. In order to work correctly, Join (and Merge) require that the data be sorted on the join/merge key columns and should be partitioned on at least the major (first) join/merge key column. The partitioning is to ensure that all rows with the same key value(s) are located in the same partitions for all input links.

With those environment variable settings you have disabled automatic sort and partition insertion at runtime. As a result, the Auto partition option is effectively disabled (no repartitioning will be performed where you have Auto selected). You will need to specify which partitioning type and keys, as well as sort keys and options.

I personally prefer using the sort stage over input link sort options as the stage allows the use of a few more options and is more visible in the job design.