Join Performance

pravin1581 · Post by **pravin1581** » Tue Dec 18, 2007 2:57 pm

Hi All,

I would like to know does changing the partttion type from Auto to something else improves the performance of Join.In our case it is a join between a table which 95 million records and another link from a join stage which holds 4500 record. In both the join stages Auto partition has been used, the second join i.e. between table(95 mil) and the link from the first join takes almost an hour to complete even though the output is 1200 records.The join type is Left Outer Join with the larger table being the Right link in the order.

Thanks in advance.

ray.wurlod · Post by **ray.wurlod** » Tue Dec 18, 2007 4:07 pm

Probably not. Auto will allocate Hash as the partitioning algorithm on the inputs to a Join stage - you can verify this by inspection of the job score.

The amount of time taken by the first (or, indeed, any) Join stage is not a factor of the number of output rows - it is a factor of the number of input rows. Remember, too, that the inputs must be sorted on the join keys - it is best to emplace a specific Sort stage for this (not least because you can then use the sort mode of "don't sort (previously sorted)" if applicable).

pravin1581 · Post by **pravin1581** » Wed Dec 19, 2007 11:32 am

ray.wurlod wrote:Probably not. Auto will allocate Hash as the partitioning algorithm on the inputs to a Join stage - you can verify this by inspection of the job score.

The amount of time taken by the first (or, indeed, any) Join stage is not a factor of the number of output rows - it is a factor of the number of input rows. Remember, too, that the inputs must be sorted on the join keys - it is best to emplace a specific Sort stage for this (not least because you can then use the sort mode of "don't sort (previously sorted)" if applicable).

Thanks for the reply, we have included Sort stage after the table with hash partitioninng on the Join keys and in the Join stage made the partition type to Same. Is it necessary to include Sort for the other link as well, now it is based on Auto partition.

vijay.rajendran · Post by **vijay.rajendran** » Wed Dec 19, 2007 8:24 pm

Have you thought of using lookup instead of join? the lookup dataset (4500) will be held in the memory and the 95mil need not be sorted. Just a thought.

ray.wurlod · Post by **ray.wurlod** » Wed Dec 19, 2007 11:16 pm

If you do not specify sorting where sorting is required, DataStage will insert a tsort operator anyway, with default characteristics. This can be seen in the job score. You will probably end up with a sub-optimal solution.

pravin1581 · Post by **pravin1581** » Thu Dec 20, 2007 11:35 am

ray.wurlod wrote:If you do not specify sorting where sorting is required, DataStage will insert a tsort operator anyway, with default characteristics. This can be seen in the job score. You will probably end up with a sub-optimal solution.

Even after including a specific sort stage afte the table having 95 million records , the join performance didn't improve it remained the same.

ray.wurlod · Post by **ray.wurlod** » Thu Dec 20, 2007 3:14 pm

Then my original response ("probably not") remains.

pravin1581 · Post by **pravin1581** » Thu Dec 20, 2007 3:18 pm

ray.wurlod wrote:Then my original response ("probably not") remains.

We have resorted to Hash partitioning for the link in which sort stage has been inserted the one with the greater volume of data and the other link has Auto partitioning without the sort stage, the one with the smaller volume of data.