Performance of job vs number of join stages

verify · Post by **verify** » Thu Jul 03, 2008 5:37 am

Hi,

What happens to the performance of the job if the number of join(full outer join) stages increases.

Thanks.

mahadev.v · Post by **mahadev.v** » Thu Jul 03, 2008 5:46 am

Performance reduces.

DSDexter · Post by **DSDexter** » Thu Jul 03, 2008 6:44 am

mahadev.v wrote:Performance reduces.

I dont think so....

You are adding more joiners just beacuse they have some functionality to be implemented (I'll assume this). So if you want your job to work more than the time consumed will be more (Not Performance). If you want performance improvement try revamping the logic.

ray.wurlod · Post by **ray.wurlod** » Thu Jul 03, 2008 4:06 pm

Without understanding your definition of "performance" in an ETL context your question is impossible to answer.

keshav0307 · Post by **keshav0307** » Thu Jul 03, 2008 7:24 pm

you may see performance hit and also lack of sort space

verify · Post by **verify** » Sun Jul 06, 2008 11:44 pm

Hi All,

Thanks for you reply. Let me explain the issue clearly. We have a job with 6 sequential files as input, a join stage. Initially we had used left outer join. Now the requirement is that we need to use full outer join to get the unmatched data. As the full outer join supports only 2 inputs we need to use 5 join stages for this job. Now, does the time taken by the job increase as we are increasing the number join stages. If so do we have any alternate solution.

Thanks.

ray.wurlod · Post by **ray.wurlod** » Sun Jul 06, 2008 11:53 pm

The total time should not be an issue using cascaded Join stages, because pipeline parallelism keeps the data flowing. Join stages are fairly good at throughput, because they rely on the fact that both left and right inputs are identically partitioned and sorted so, for example, need only one row at a time from the left input to perform inner and left outer joins.

Impossible to say time will be "reduced" or "increased" without something against which to compare. But a job with five Join stages should not take 25% more time than a job with four Join stages (the figure one would expect arithmetically). It should do better than that.

Beware, however, that operator combination is extremely eager to combine as many operators as possible, which might prove counter-productive in your case. It might be a good plan to make the third Join stage of five non-combinable (on its stage Advanced properties tab).
Otherwise one process might become overwhelmed at trying to effect all five joins.

keshav0307 · Post by **keshav0307** » Mon Jul 07, 2008 1:05 am

sorting of the data will take most of the time. So if the partition and join key are same for all files then there should not make major difference in total time.

ray.wurlod · Post by **ray.wurlod** » Mon Jul 07, 2008 1:32 am

keshav0307 wrote:sorting of the data will take most of the time. So if the partition and join key are same for all files then there should be major difference in total time.

I think you left out "not".