Hello,
I've a job in which there is a requirement to Hash partition on two different key sets in two different stages. Assume keys A & B for Stage2 and keys C & D for Stage3. Partial job design below:
Stage1 -> Stage2 -> Stage3 ...
If I Hash partition on keys A, B, C & D in Stage1 then would repartitioning be avoided in Stage2 and Stage3 even though I would have explicitly mentioned Hash partitioning on A & B in Stage2 and C & D in Stage3?
If repartitioning is not avoided then would replacing Hash partitioning in Stage2 & Stage3 by Same partitioning result in expected result?
I'm under impression that if initial stage of a job hash partitions on all keys which would be later hash partitioned, on few keys in different stages, then repartitioning doesn't occur. Please correct me if I'm wrong.
Thanks
Question regarding hash partitioning
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Re: Question regarding hash partitioning
Why?mavrick21 wrote:there is a requirement to Hash partition
You really aren't telling us much.
What do you understand to be the point of a key-based partitioning algorithm such as Hash or Modulus?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Re: Question regarding hash partitioning
Ray,ray.wurlod wrote: What do you understand to be the point of a key-based partitioning algorithm such as Hash or Modulus?
Correct me if I'm wrong. The point of a key-based partitioning algorithm is to get all records with same keys in the same partition.
Below is the job which I'm working on-
Code: Select all
DataSet3 DataSet4
| |
DataSet1------->Copy------------->Join1------------>Join2------>DataSet2
I'm thinking in terms of performance gain if my assumption is correct. My assumption is- If I sort and hash partition on keys A, B, C & D in Copy stage then I don't have to use Sort & Hash option for Join1 and Join2.
Is my assumption right?
Or is it that for a set of keys a hash is created and then partitioned based on hash value? If this is how DataStage works then there is no use of sorting and partitioning in Copy stage since it will again partition data in Join1 and Join2.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
What are the join keys? It's those on which the data have to be partitioned, or at the very least on the first of them*. However, the data have to be sorted on all join keys.
* Why is this?
If you partition on A alone, all combinations of A and B for any given value of A will be on the same partition.
The reason for partitioning as early as possible in a job (and using Same downstream of that) is to avoid unnecessary re-partitioning. By default (Auto) partitioning into your Copy stage uses Round Robin algorithm - the data would have to be re-partitioned to achieve Hash partitioning. But the Round Robin partitioning is actually unnecessary - it achieves nothing other than parallel execution, which Hash partitioning also achieves.
Round Robin is the algorithm of choice is no key-based partitioning is required in the job, because it spreads rows evenly among the available nodes.
* Why is this?
If you partition on A alone, all combinations of A and B for any given value of A will be on the same partition.
The reason for partitioning as early as possible in a job (and using Same downstream of that) is to avoid unnecessary re-partitioning. By default (Auto) partitioning into your Copy stage uses Round Robin algorithm - the data would have to be re-partitioned to achieve Hash partitioning. But the Round Robin partitioning is actually unnecessary - it achieves nothing other than parallel execution, which Hash partitioning also achieves.
Round Robin is the algorithm of choice is no key-based partitioning is required in the job, because it spreads rows evenly among the available nodes.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.