sort and hash partitioning

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dnat
Participant
Posts: 200
Joined: Thu Sep 06, 2007 2:06 am

sort and hash partitioning

Post by dnat »

Hi

I have a question on sorting before join.

Is it good to use explicit sort stage or doing a sort in the stage itself(by checking "Perform Sort") and do a hash partitioning?

If we are doing hash partitioning, then why do we need to sort the data, as the hash partitioning is done on keys?

If i just use Auto Partitioning, then is it fine rather than doing a sort and hash?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Sorting is required because of the way that the Join stage works.

Even though the hash partitioning directs every row with value "X" to the same partition, there's no guarantee that they're adjacent rows in the data.

Auto partitioning on a Join stage will give you "Hash". This may not be optimal, particularly if the join key is a single integer, in which case Modulus will be more efficient a partitioning algorithm.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Marley777
Premium Member
Premium Member
Posts: 316
Joined: Tue Jan 27, 2004 3:26 pm

Post by Marley777 »

If I'm using a join stage to join two links having 10 million unsorted rows on each link, and set auto partitoning will auto partitioning take care of both sort and hash partitioning?

Here is how my env variable are set
APT_NO_SORT_INSERTION = FALSE
APT_NO_PART_INSERTION = FALSE
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

No, you are almost certain to get incorrect results. And performance is likely to be woeful or you'll get "row out of order" fatal errors.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Marley777 wrote:If I'm using a join stage to join two links having 10 million unsorted rows on each link, and set auto partitoning will auto partitioning take care of both sort and hash partitioning?

Here is how my env variable are set
APT_NO_SORT_INSERTION = FALSE
APT_NO_PART_INSERTION = FALSE
Partitioning and Sorting are two separate operations. In order to work correctly, Join (and Merge) require that the data be sorted on the join/merge key columns and should be partitioned on at least the major (first) join/merge key column. The partitioning is to ensure that all rows with the same key value(s) are located in the same partitions for all input links.

With those environment variable settings you have disabled automatic sort and partition insertion at runtime. As a result, the Auto partition option is effectively disabled (no repartitioning will be performed where you have Auto selected). You will need to specify which partitioning type and keys, as well as sort keys and options.

I personally prefer using the sort stage over input link sort options as the stage allows the use of a few more options and is more visible in the job design.
Post Reply