Question regarding hash partitioning

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Question regarding hash partitioning

Post by mavrick21 »

Hello,

I've a job in which there is a requirement to Hash partition on two different key sets in two different stages. Assume keys A & B for Stage2 and keys C & D for Stage3. Partial job design below:

Stage1 -> Stage2 -> Stage3 ...

If I Hash partition on keys A, B, C & D in Stage1 then would repartitioning be avoided in Stage2 and Stage3 even though I would have explicitly mentioned Hash partitioning on A & B in Stage2 and C & D in Stage3?

If repartitioning is not avoided then would replacing Hash partitioning in Stage2 & Stage3 by Same partitioning result in expected result?

I'm under impression that if initial stage of a job hash partitions on all keys which would be later hash partitioned, on few keys in different stages, then repartitioning doesn't occur. Please correct me if I'm wrong.

Thanks
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Re: Question regarding hash partitioning

Post by ray.wurlod »

mavrick21 wrote:there is a requirement to Hash partition
Why?

You really aren't telling us much.

What do you understand to be the point of a key-based partitioning algorithm such as Hash or Modulus?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Re: Question regarding hash partitioning

Post by mavrick21 »

ray.wurlod wrote: What do you understand to be the point of a key-based partitioning algorithm such as Hash or Modulus?
Ray,

Correct me if I'm wrong. The point of a key-based partitioning algorithm is to get all records with same keys in the same partition.

Below is the job which I'm working on-

Code: Select all

                               DataSet3             DataSet4
                                   |                   |

DataSet1------->Copy------------->Join1------------>Join2------>DataSet2

Currently here is what I'm doing: Used Sort and Hash partition on Keys A & B in Join1. Did the same in Join2 but for keys C & D. It works fine on a 4-node config file.

I'm thinking in terms of performance gain if my assumption is correct. My assumption is- If I sort and hash partition on keys A, B, C & D in Copy stage then I don't have to use Sort & Hash option for Join1 and Join2.
Is my assumption right?

Or is it that for a set of keys a hash is created and then partitioned based on hash value? If this is how DataStage works then there is no use of sorting and partitioning in Copy stage since it will again partition data in Join1 and Join2.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What are the join keys? It's those on which the data have to be partitioned, or at the very least on the first of them*. However, the data have to be sorted on all join keys.

* Why is this?
If you partition on A alone, all combinations of A and B for any given value of A will be on the same partition.

The reason for partitioning as early as possible in a job (and using Same downstream of that) is to avoid unnecessary re-partitioning. By default (Auto) partitioning into your Copy stage uses Round Robin algorithm - the data would have to be re-partitioned to achieve Hash partitioning. But the Round Robin partitioning is actually unnecessary - it achieves nothing other than parallel execution, which Hash partitioning also achieves.

Round Robin is the algorithm of choice is no key-based partitioning is required in the job, because it spreads rows evenly among the available nodes.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Post by mavrick21 »

Thanks Ray!
Post Reply