Question regarding hash partitioning

mavrick21 · Post by **mavrick21** » Wed Jun 23, 2010 12:53 pm

Hello,

I've a job in which there is a requirement to Hash partition on two different key sets in two different stages. Assume keys A & B for Stage2 and keys C & D for Stage3. Partial job design below:

Stage1 -> Stage2 -> Stage3 ...

If I Hash partition on keys A, B, C & D in Stage1 then would repartitioning be avoided in Stage2 and Stage3 even though I would have explicitly mentioned Hash partitioning on A & B in Stage2 and C & D in Stage3?

If repartitioning is not avoided then would replacing Hash partitioning in Stage2 & Stage3 by Same partitioning result in expected result?

I'm under impression that if initial stage of a job hash partitions on all keys which would be later hash partitioned, on few keys in different stages, then repartitioning doesn't occur. Please correct me if I'm wrong.

Thanks

ray.wurlod · Post by **ray.wurlod** » Wed Jun 23, 2010 5:21 pm

mavrick21 wrote:there is a requirement to Hash partition

Why?

You really aren't telling us much.

What do you understand to be the point of a key-based partitioning algorithm such as Hash or Modulus?

mavrick21 · Post by **mavrick21** » Thu Jun 24, 2010 12:34 pm

ray.wurlod wrote: What do you understand to be the point of a key-based partitioning algorithm such as Hash or Modulus?

Ray,

Correct me if I'm wrong. The point of a key-based partitioning algorithm is to get all records with same keys in the same partition.

Below is the job which I'm working on-

Code: Select all

                               DataSet3             DataSet4
                                   |                   |

DataSet1------->Copy------------->Join1------------>Join2------>DataSet2

Currently here is what I'm doing: Used Sort and Hash partition on Keys A & B in Join1. Did the same in Join2 but for keys C & D. It works fine on a 4-node config file.

I'm thinking in terms of performance gain if my assumption is correct. My assumption is- If I sort and hash partition on keys A, B, C & D in Copy stage then I don't have to use Sort & Hash option for Join1 and Join2.
Is my assumption right?

Or is it that for a set of keys a hash is created and then partitioned based on hash value? If this is how DataStage works then there is no use of sorting and partitioning in Copy stage since it will again partition data in Join1 and Join2.

ray.wurlod · Post by **ray.wurlod** » Thu Jun 24, 2010 4:56 pm

What are the join keys? It's those on which the data have to be partitioned, or at the very least on the first of them*. However, the data have to be sorted on all join keys.

* Why is this?
If you partition on A alone, all combinations of A and B for any given value of A will be on the same partition.

The reason for partitioning as early as possible in a job (and using Same downstream of that) is to avoid unnecessary re-partitioning. By default (Auto) partitioning into your Copy stage uses Round Robin algorithm - the data would have to be re-partitioned to achieve Hash partitioning. But the Round Robin partitioning is actually unnecessary - it achieves nothing other than parallel execution, which Hash partitioning also achieves.

Round Robin is the algorithm of choice is no key-based partitioning is required in the job, because it spreads rows evenly among the available nodes.

mavrick21 · Post by **mavrick21** » Tue Jun 29, 2010 5:30 pm

Thanks Ray!

DSXchange

Question regarding hash partitioning

Question regarding hash partitioning

Re: Question regarding hash partitioning

Re: Question regarding hash partitioning