Hi
I have a question on sorting before join.
Is it good to use explicit sort stage or doing a sort in the stage itself(by checking "Perform Sort") and do a hash partitioning?
If we are doing hash partitioning, then why do we need to sort the data, as the hash partitioning is done on keys?
If i just use Auto Partitioning, then is it fine rather than doing a sort and hash?
sort and hash partitioning
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Sorting is required because of the way that the Join stage works.
Even though the hash partitioning directs every row with value "X" to the same partition, there's no guarantee that they're adjacent rows in the data.
Auto partitioning on a Join stage will give you "Hash". This may not be optimal, particularly if the join key is a single integer, in which case Modulus will be more efficient a partitioning algorithm.
Even though the hash partitioning directs every row with value "X" to the same partition, there's no guarantee that they're adjacent rows in the data.
Auto partitioning on a Join stage will give you "Hash". This may not be optimal, particularly if the join key is a single integer, in which case Modulus will be more efficient a partitioning algorithm.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Partitioning and Sorting are two separate operations. In order to work correctly, Join (and Merge) require that the data be sorted on the join/merge key columns and should be partitioned on at least the major (first) join/merge key column. The partitioning is to ensure that all rows with the same key value(s) are located in the same partitions for all input links.Marley777 wrote:If I'm using a join stage to join two links having 10 million unsorted rows on each link, and set auto partitoning will auto partitioning take care of both sort and hash partitioning?
Here is how my env variable are set
APT_NO_SORT_INSERTION = FALSE
APT_NO_PART_INSERTION = FALSE
With those environment variable settings you have disabled automatic sort and partition insertion at runtime. As a result, the Auto partition option is effectively disabled (no repartitioning will be performed where you have Auto selected). You will need to specify which partitioning type and keys, as well as sort keys and options.
I personally prefer using the sort stage over input link sort options as the stage allows the use of a few more options and is more visible in the job design.