Confusion on Partitioning for Join stage
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 258
- Joined: Tue Jul 04, 2006 10:35 pm
- Location: Toronto
Re: Confusion on Partitioning for JOin stage
Hi,
If the data that you are joining is big, its better to use hash Prtition and use the join keys as the keys for partitioning specially if your jobs are sunning on multiple nodes. This will also improve performance.
If the data that you are joining is big, its better to use hash Prtition and use the join keys as the keys for partitioning specially if your jobs are sunning on multiple nodes. This will also improve performance.
Cheers,
Samyam
Samyam
-
- Premium Member
- Posts: 258
- Joined: Tue Jul 04, 2006 10:35 pm
- Location: Toronto
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
For a Join stage DataStage will always (under (Auto)) use hash partitioning on the Join Keys, and sort on the Join keys unless there is sorting specified either on, or immediately upstream on, the input link.
It may be more efficient to partition on the leading subset of the Join Keys, but this intelligence is not built into DataStage (since it's not always the case, but hash on all join keys will always provide the correct answer).
It may be more efficient to partition on the leading subset of the Join Keys, but this intelligence is not built into DataStage (since it's not always the case, but hash on all join keys will always provide the correct answer).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
There's nothing about Auto that 'spends time and effort' to determine the most efficient method to use for each stage and how it is being used in your job. Found a detailed discussion here that may help. Also thought this quote from it was worth putting here:
I also seem to recall a post where Ray specifically noted what Auto chooses for each stage type but couldn't turn it up.jwiles wrote:Auto partitioning will guarantee that the partitioning method chosen (if necessary) will meet the needs of the stage requiring said partitioning.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
(Auto) uses hash partitioning on stages that designate Keys except as follows.
(Auto) uses DB2 partitioning for DB2 Connector stage.
(Auto) uses Entire partitioning for reference input to Lookup stage.
(Auto) uses Same partitioning for adjacent stages executing in parallel using the same node pool.
If there are no Keys needed (whether or not they are specified), (Auto) uses Round Robin partitioning.
(Auto) uses "eager Round Robin" collection (designated as (Auto)).
(Auto) uses DB2 partitioning for DB2 Connector stage.
(Auto) uses Entire partitioning for reference input to Lookup stage.
(Auto) uses Same partitioning for adjacent stages executing in parallel using the same node pool.
If there are no Keys needed (whether or not they are specified), (Auto) uses Round Robin partitioning.
(Auto) uses "eager Round Robin" collection (designated as (Auto)).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 258
- Joined: Tue Jul 04, 2006 10:35 pm
- Location: Toronto
-
- Participant
- Posts: 527
- Joined: Thu Apr 19, 2007 1:25 am
- Location: Melbourne
I would think it makes you think about what you're doing. If you just take the easy way out and Auto everything, you stop taking note of what you're doing and soon you find your job is repartitioning the data numerous times (and sometimes needlessly), and performance will suffer.
If you make yourself do the partitioning, you think about what you're doing and order your joins etc to minimise the amount of work both you and the job has to do.
If you make yourself do the partitioning, you think about what you're doing and order your joins etc to minimise the amount of work both you and the job has to do.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
(Auto) is guaranteed always to work (to deliver correct results, all else being equal).
(Auto) is not guaranteed to be optimally efficient in all cases. This is where someone with some knowledge can get a job to perform better (finish faster).
(Auto) is not guaranteed to be optimally efficient in all cases. This is where someone with some knowledge can get a job to perform better (finish faster).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 258
- Joined: Tue Jul 04, 2006 10:35 pm
- Location: Toronto
-
- Premium Member
- Posts: 258
- Joined: Tue Jul 04, 2006 10:35 pm
- Location: Toronto
Just another thought on this.
The job design is like this.
In the above case will the RemoveDuplicate stage does a sort again because its Auto?
In this case will the RemoveDuplicate stage does not do a sort again?
The job design is like this.
Code: Select all
- - - - >Sort(Partitioned: Hash)- - - - >RemoveDuplicate(Auto)
Code: Select all
-------> Sort(Partitioned: Hash)----------> RemoveDuplicate(Partitioned:Same)
In this case will the RemoveDuplicate stage does not do a sort again?
Cheers,
Samyam
Samyam
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
In neither case will the Remove Duplicates have any tsort operator included, because there is an explicit sort on the input link (in this case a Sort stage).
Partitioning and sorting are separate from each other.
Partitioning and sorting are separate from each other.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 258
- Joined: Tue Jul 04, 2006 10:35 pm
- Location: Toronto
-
- Participant
- Posts: 527
- Joined: Thu Apr 19, 2007 1:25 am
- Location: Melbourne
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
To reprise my earlier answer:
(Auto) is guaranteed always to work (to deliver correct results, all else being equal).
(Auto) is not guaranteed to be optimally efficient in all cases. This is where someone with some knowledge can get a job to perform better (finish faster).
(Auto) is guaranteed always to work (to deliver correct results, all else being equal).
(Auto) is not guaranteed to be optimally efficient in all cases. This is where someone with some knowledge can get a job to perform better (finish faster).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.