Page 1 of 1

Performance of job vs number of join stages

Posted: Thu Jul 03, 2008 5:37 am
by verify
Hi,

What happens to the performance of the job if the number of join(full outer join) stages increases.



Thanks.

Posted: Thu Jul 03, 2008 5:46 am
by mahadev.v
Performance reduces.

Posted: Thu Jul 03, 2008 6:44 am
by DSDexter
mahadev.v wrote:Performance reduces.
I dont think so.... :shock:

You are adding more joiners just beacuse they have some functionality to be implemented (I'll assume this). So if you want your job to work more than the time consumed will be more (Not Performance). If you want performance improvement try revamping the logic.

Posted: Thu Jul 03, 2008 4:06 pm
by ray.wurlod
Without understanding your definition of "performance" in an ETL context your question is impossible to answer.

Posted: Thu Jul 03, 2008 7:24 pm
by keshav0307
you may see performance hit and also lack of sort space

Posted: Sun Jul 06, 2008 11:44 pm
by verify
Hi All,

Thanks for you reply. Let me explain the issue clearly. We have a job with 6 sequential files as input, a join stage. Initially we had used left outer join. Now the requirement is that we need to use full outer join to get the unmatched data. As the full outer join supports only 2 inputs we need to use 5 join stages for this job. Now, does the time taken by the job increase as we are increasing the number join stages. If so do we have any alternate solution.


Thanks.

Posted: Sun Jul 06, 2008 11:53 pm
by ray.wurlod
The total time should not be an issue using cascaded Join stages, because pipeline parallelism keeps the data flowing. Join stages are fairly good at throughput, because they rely on the fact that both left and right inputs are identically partitioned and sorted so, for example, need only one row at a time from the left input to perform inner and left outer joins.

Impossible to say time will be "reduced" or "increased" without something against which to compare. But a job with five Join stages should not take 25% more time than a job with four Join stages (the figure one would expect arithmetically). It should do better than that.

Beware, however, that operator combination is extremely eager to combine as many operators as possible, which might prove counter-productive in your case. It might be a good plan to make the third Join stage of five non-combinable (on its stage Advanced properties tab).
Otherwise one process might become overwhelmed at trying to effect all five joins.

Posted: Mon Jul 07, 2008 1:05 am
by keshav0307
sorting of the data will take most of the time. So if the partition and join key are same for all files then there should not make major difference in total time.

Posted: Mon Jul 07, 2008 1:32 am
by ray.wurlod
keshav0307 wrote:sorting of the data will take most of the time. So if the partition and join key are same for all files then there should be major difference in total time.
I think you left out "not".