Confusion on Partitioning for Join stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Re: Confusion on Partitioning for JOin stage

Post by samyamkrishna »

Hi,

If the data that you are joining is big, its better to use hash Prtition and use the join keys as the keys for partitioning specially if your jobs are sunning on multiple nodes. This will also improve performance.
Cheers,
Samyam
sg33
Participant
Posts: 25
Joined: Wed Nov 24, 2010 3:07 am
Location: India

Post by sg33 »

Sorry this may sound naive..but from what you suggest. If i use auto partitioning shouldn't DS know which way to partition the data is the most efficient one? How does defining hash partitioning explicitly going to help?
Best Regards
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Post by samyamkrishna »

If you specify partitioning then DS doesnt have to spend time and effort to identify which is most efficient way. It can just does what you have asked it to do thus saving time.

hope this helps.
Cheers,
Samyam
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

For a Join stage DataStage will always (under (Auto)) use hash partitioning on the Join Keys, and sort on the Join keys unless there is sorting specified either on, or immediately upstream on, the input link.

It may be more efficient to partition on the leading subset of the Join Keys, but this intelligence is not built into DataStage (since it's not always the case, but hash on all join keys will always provide the correct answer).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

There's nothing about Auto that 'spends time and effort' to determine the most efficient method to use for each stage and how it is being used in your job. Found a detailed discussion here that may help. Also thought this quote from it was worth putting here:
jwiles wrote:Auto partitioning will guarantee that the partitioning method chosen (if necessary) will meet the needs of the stage requiring said partitioning.
I also seem to recall a post where Ray specifically noted what Auto chooses for each stage type but couldn't turn it up.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

(Auto) uses hash partitioning on stages that designate Keys except as follows.

(Auto) uses DB2 partitioning for DB2 Connector stage.

(Auto) uses Entire partitioning for reference input to Lookup stage.

(Auto) uses Same partitioning for adjacent stages executing in parallel using the same node pool.

If there are no Keys needed (whether or not they are specified), (Auto) uses Round Robin partitioning.

(Auto) uses "eager Round Robin" collection (designated as (Auto)).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Post by samyamkrishna »

Hi Ray/chulett,

Thats great to know.
Its dosent answer sg33's question or my confusion.

Why do we need to partition if DS is intelligent enough to do it on its own?
Cheers,
Samyam
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

I would think it makes you think about what you're doing. If you just take the easy way out and Auto everything, you stop taking note of what you're doing and soon you find your job is repartitioning the data numerous times (and sometimes needlessly), and performance will suffer.

If you make yourself do the partitioning, you think about what you're doing and order your joins etc to minimise the amount of work both you and the job has to do.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

(Auto) is guaranteed always to work (to deliver correct results, all else being equal).

(Auto) is not guaranteed to be optimally efficient in all cases. This is where someone with some knowledge can get a job to perform better (finish faster).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Post by samyamkrishna »

Thanks Ray and Stuart...
Cheers,
Samyam
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Post by samyamkrishna »

Just another thought on this.

The job design is like this.

Code: Select all

- - - - >Sort(Partitioned: Hash)- - - - >RemoveDuplicate(Auto)

In the above case will the RemoveDuplicate stage does a sort again because its Auto?


Code: Select all

-------> Sort(Partitioned: Hash)----------> RemoveDuplicate(Partitioned:Same)


In this case will the RemoveDuplicate stage does not do a sort again?
Cheers,
Samyam
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

In neither case will the Remove Duplicates have any tsort operator included, because there is an explicit sort on the input link (in this case a Sort stage).

Partitioning and sorting are separate from each other.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Post by samyamkrishna »

Sorry Ray. I shouldnt have said sort.
That just confused everyone.

So will the Remove Duplicate in the first case do a partition again on the keys because its in Auto mode.
Thats supposed to be my question.
Cheers,
Samyam
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

If it were me, I would take the 2 seconds and pick Same.
I should be telling DS what to do. That's what people pay me for.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

To reprise my earlier answer:

(Auto) is guaranteed always to work (to deliver correct results, all else being equal).

(Auto) is not guaranteed to be optimally efficient in all cases. This is where someone with some knowledge can get a job to perform better (finish faster).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply