Performance of job vs number of join stages

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
verify
Premium Member
Premium Member
Posts: 99
Joined: Sun Mar 30, 2008 8:35 am

Performance of job vs number of join stages

Post by verify »

Hi,

What happens to the performance of the job if the number of join(full outer join) stages increases.



Thanks.
RK Raju
mahadev.v
Participant
Posts: 111
Joined: Tue May 06, 2008 5:29 am
Location: Bangalore

Post by mahadev.v »

Performance reduces.
"given enough eyeballs, all bugs are shallow" - Eric S. Raymond
DSDexter
Participant
Posts: 94
Joined: Wed Jul 11, 2007 9:36 pm
Location: Pune,India

Post by DSDexter »

mahadev.v wrote:Performance reduces.
I dont think so.... :shock:

You are adding more joiners just beacuse they have some functionality to be implemented (I'll assume this). So if you want your job to work more than the time consumed will be more (Not Performance). If you want performance improvement try revamping the logic.
Thanks
DSDexter
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Without understanding your definition of "performance" in an ETL context your question is impossible to answer.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
keshav0307
Premium Member
Premium Member
Posts: 783
Joined: Mon Jan 16, 2006 10:17 pm
Location: Sydney, Australia

Post by keshav0307 »

you may see performance hit and also lack of sort space
verify
Premium Member
Premium Member
Posts: 99
Joined: Sun Mar 30, 2008 8:35 am

Post by verify »

Hi All,

Thanks for you reply. Let me explain the issue clearly. We have a job with 6 sequential files as input, a join stage. Initially we had used left outer join. Now the requirement is that we need to use full outer join to get the unmatched data. As the full outer join supports only 2 inputs we need to use 5 join stages for this job. Now, does the time taken by the job increase as we are increasing the number join stages. If so do we have any alternate solution.


Thanks.
RK Raju
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The total time should not be an issue using cascaded Join stages, because pipeline parallelism keeps the data flowing. Join stages are fairly good at throughput, because they rely on the fact that both left and right inputs are identically partitioned and sorted so, for example, need only one row at a time from the left input to perform inner and left outer joins.

Impossible to say time will be "reduced" or "increased" without something against which to compare. But a job with five Join stages should not take 25% more time than a job with four Join stages (the figure one would expect arithmetically). It should do better than that.

Beware, however, that operator combination is extremely eager to combine as many operators as possible, which might prove counter-productive in your case. It might be a good plan to make the third Join stage of five non-combinable (on its stage Advanced properties tab).
Otherwise one process might become overwhelmed at trying to effect all five joins.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
keshav0307
Premium Member
Premium Member
Posts: 783
Joined: Mon Jan 16, 2006 10:17 pm
Location: Sydney, Australia

Post by keshav0307 »

sorting of the data will take most of the time. So if the partition and join key are same for all files then there should not make major difference in total time.
Last edited by keshav0307 on Mon Jul 07, 2008 6:20 pm, edited 1 time in total.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

keshav0307 wrote:sorting of the data will take most of the time. So if the partition and join key are same for all files then there should be major difference in total time.
I think you left out "not".
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply