Performance of job vs number of join stages
Moderators: chulett, rschirm, roy
Performance of job vs number of join stages
Hi,
What happens to the performance of the job if the number of join(full outer join) stages increases.
Thanks.
What happens to the performance of the job if the number of join(full outer join) stages increases.
Thanks.
RK Raju
I dont think so....mahadev.v wrote:Performance reduces.
![Shocked :shock:](./images/smilies/icon_eek.gif)
You are adding more joiners just beacuse they have some functionality to be implemented (I'll assume this). So if you want your job to work more than the time consumed will be more (Not Performance). If you want performance improvement try revamping the logic.
Thanks
DSDexter
DSDexter
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Premium Member
- Posts: 783
- Joined: Mon Jan 16, 2006 10:17 pm
- Location: Sydney, Australia
Hi All,
Thanks for you reply. Let me explain the issue clearly. We have a job with 6 sequential files as input, a join stage. Initially we had used left outer join. Now the requirement is that we need to use full outer join to get the unmatched data. As the full outer join supports only 2 inputs we need to use 5 join stages for this job. Now, does the time taken by the job increase as we are increasing the number join stages. If so do we have any alternate solution.
Thanks.
Thanks for you reply. Let me explain the issue clearly. We have a job with 6 sequential files as input, a join stage. Initially we had used left outer join. Now the requirement is that we need to use full outer join to get the unmatched data. As the full outer join supports only 2 inputs we need to use 5 join stages for this job. Now, does the time taken by the job increase as we are increasing the number join stages. If so do we have any alternate solution.
Thanks.
RK Raju
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The total time should not be an issue using cascaded Join stages, because pipeline parallelism keeps the data flowing. Join stages are fairly good at throughput, because they rely on the fact that both left and right inputs are identically partitioned and sorted so, for example, need only one row at a time from the left input to perform inner and left outer joins.
Impossible to say time will be "reduced" or "increased" without something against which to compare. But a job with five Join stages should not take 25% more time than a job with four Join stages (the figure one would expect arithmetically). It should do better than that.
Beware, however, that operator combination is extremely eager to combine as many operators as possible, which might prove counter-productive in your case. It might be a good plan to make the third Join stage of five non-combinable (on its stage Advanced properties tab).
Otherwise one process might become overwhelmed at trying to effect all five joins.
Impossible to say time will be "reduced" or "increased" without something against which to compare. But a job with five Join stages should not take 25% more time than a job with four Join stages (the figure one would expect arithmetically). It should do better than that.
Beware, however, that operator combination is extremely eager to combine as many operators as possible, which might prove counter-productive in your case. It might be a good plan to make the third Join stage of five non-combinable (on its stage Advanced properties tab).
Otherwise one process might become overwhelmed at trying to effect all five joins.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 783
- Joined: Mon Jan 16, 2006 10:17 pm
- Location: Sydney, Australia
sorting of the data will take most of the time. So if the partition and join key are same for all files then there should not make major difference in total time.
Last edited by keshav0307 on Mon Jul 07, 2008 6:20 pm, edited 1 time in total.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
I think you left out "not".keshav0307 wrote:sorting of the data will take most of the time. So if the partition and join key are same for all files then there should be major difference in total time.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.