Confusion on Partitioning for Join stage

samyamkrishna · Post by **samyamkrishna** » Mon Dec 07, 2015 10:41 am

Hi,

If the data that you are joining is big, its better to use hash Prtition and use the join keys as the keys for partitioning specially if your jobs are sunning on multiple nodes. This will also improve performance.

sg33 · Post by **sg33** » Mon Dec 07, 2015 12:26 pm

Sorry this may sound naive..but from what you suggest. If i use auto partitioning shouldn't DS know which way to partition the data is the most efficient one? How does defining hash partitioning explicitly going to help?

samyamkrishna · Post by **samyamkrishna** » Mon Dec 07, 2015 1:30 pm

If you specify partitioning then DS doesnt have to spend time and effort to identify which is most efficient way. It can just does what you have asked it to do thus saving time.

hope this helps.

ray.wurlod · Post by **ray.wurlod** » Mon Dec 07, 2015 3:48 pm

For a Join stage DataStage will always (under (Auto)) use hash partitioning on the Join Keys, and sort on the Join keys unless there is sorting specified either on, or immediately upstream on, the input link.

It may be more efficient to partition on the leading subset of the Join Keys, but this intelligence is not built into DataStage (since it's not always the case, but hash on all join keys will always provide the correct answer).

chulett · Post by **chulett** » Mon Dec 07, 2015 3:51 pm

There's nothing about Auto that 'spends time and effort' to determine the most efficient method to use for each stage and how it is being used in your job. Found a detailed discussion here that may help. Also thought this quote from it was worth putting here:

jwiles wrote:Auto partitioning will guarantee that the partitioning method chosen (if necessary) will meet the needs of the stage requiring said partitioning.

I also seem to recall a post where Ray specifically noted what Auto chooses for each stage type but couldn't turn it up.

ray.wurlod · Post by **ray.wurlod** » Mon Dec 07, 2015 4:00 pm

(Auto) uses hash partitioning on stages that designate Keys except as follows.

(Auto) uses DB2 partitioning for DB2 Connector stage.

(Auto) uses Entire partitioning for reference input to Lookup stage.

(Auto) uses Same partitioning for adjacent stages executing in parallel using the same node pool.

If there are no Keys needed (whether or not they are specified), (Auto) uses Round Robin partitioning.

(Auto) uses "eager Round Robin" collection (designated as (Auto)).

samyamkrishna · Post by **samyamkrishna** » Mon Dec 07, 2015 4:21 pm

Hi Ray/chulett,

Thats great to know.
Its dosent answer sg33's question or my confusion.

Why do we need to partition if DS is intelligent enough to do it on its own?

stuartjvnorton · Post by **stuartjvnorton** » Mon Dec 07, 2015 5:20 pm

I would think it makes you think about what you're doing. If you just take the easy way out and Auto everything, you stop taking note of what you're doing and soon you find your job is repartitioning the data numerous times (and sometimes needlessly), and performance will suffer.

If you make yourself do the partitioning, you think about what you're doing and order your joins etc to minimise the amount of work both you and the job has to do.

ray.wurlod · Post by **ray.wurlod** » Tue Dec 08, 2015 12:11 am

(Auto) is guaranteed always to work (to deliver correct results, all else being equal).

(Auto) is not guaranteed to be optimally efficient in all cases. This is where someone with some knowledge can get a job to perform better (finish faster).

samyamkrishna · Post by **samyamkrishna** » Tue Dec 08, 2015 12:54 pm

Thanks Ray and Stuart...

samyamkrishna · Post by **samyamkrishna** » Wed Dec 09, 2015 1:12 pm

Just another thought on this.

The job design is like this.

Code: Select all

- - - - >Sort(Partitioned: Hash)- - - - >RemoveDuplicate(Auto)

In the above case will the RemoveDuplicate stage does a sort again because its Auto?

Code: Select all

-------> Sort(Partitioned: Hash)----------> RemoveDuplicate(Partitioned:Same)

In this case will the RemoveDuplicate stage does not do a sort again?

ray.wurlod · Post by **ray.wurlod** » Wed Dec 09, 2015 1:50 pm

In neither case will the Remove Duplicates have any tsort operator included, because there is an explicit sort on the input link (in this case a Sort stage).

Partitioning and sorting are separate from each other.

samyamkrishna · Post by **samyamkrishna** » Wed Dec 09, 2015 2:22 pm

Sorry Ray. I shouldnt have said sort.
That just confused everyone.

So will the Remove Duplicate in the first case do a partition again on the keys because its in Auto mode.
Thats supposed to be my question.

stuartjvnorton · Post by **stuartjvnorton** » Wed Dec 09, 2015 5:50 pm

If it were me, I would take the 2 seconds and pick Same.
I should be telling DS what to do. That's what people pay me for.

ray.wurlod · Post by **ray.wurlod** » Wed Dec 09, 2015 6:13 pm

To reprise my earlier answer:

(Auto) is guaranteed always to work (to deliver correct results, all else being equal).

(Auto) is not guaranteed to be optimally efficient in all cases. This is where someone with some knowledge can get a job to perform better (finish faster).

DSXchange

Confusion on Partitioning for Join stage

Re: Confusion on Partitioning for JOin stage