Join Performance

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
pravin1581
Premium Member
Premium Member
Posts: 497
Joined: Sun Dec 17, 2006 11:52 pm
Location: Kolkata
Contact:

Join Performance

Post by pravin1581 »

Hi All,

I would like to know does changing the partttion type from Auto to something else improves the performance of Join.In our case it is a join between a table which 95 million records and another link from a join stage which holds 4500 record. In both the join stages Auto partition has been used, the second join i.e. between table(95 mil) and the link from the first join takes almost an hour to complete even though the output is 1200 records.The join type is Left Outer Join with the larger table being the Right link in the order.

Thanks in advance.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Probably not. Auto will allocate Hash as the partitioning algorithm on the inputs to a Join stage - you can verify this by inspection of the job score.

The amount of time taken by the first (or, indeed, any) Join stage is not a factor of the number of output rows - it is a factor of the number of input rows. Remember, too, that the inputs must be sorted on the join keys - it is best to emplace a specific Sort stage for this (not least because you can then use the sort mode of "don't sort (previously sorted)" if applicable).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
pravin1581
Premium Member
Premium Member
Posts: 497
Joined: Sun Dec 17, 2006 11:52 pm
Location: Kolkata
Contact:

Post by pravin1581 »

ray.wurlod wrote:Probably not. Auto will allocate Hash as the partitioning algorithm on the inputs to a Join stage - you can verify this by inspection of the job score.

The amount of time taken by the first (or, indeed, any) Join stage is not a factor of the number of output rows - it is a factor of the number of input rows. Remember, too, that the inputs must be sorted on the join keys - it is best to emplace a specific Sort stage for this (not least because you can then use the sort mode of "don't sort (previously sorted)" if applicable).
Thanks for the reply, we have included Sort stage after the table with hash partitioninng on the Join keys and in the Join stage made the partition type to Same. Is it necessary to include Sort for the other link as well, now it is based on Auto partition.
vijay.rajendran
Participant
Posts: 6
Joined: Thu Sep 29, 2005 8:39 pm
Location: Sydney, Australia

Post by vijay.rajendran »

Have you thought of using lookup instead of join? the lookup dataset (4500) will be held in the memory and the 95mil need not be sorted. Just a thought.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If you do not specify sorting where sorting is required, DataStage will insert a tsort operator anyway, with default characteristics. This can be seen in the job score. You will probably end up with a sub-optimal solution.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
pravin1581
Premium Member
Premium Member
Posts: 497
Joined: Sun Dec 17, 2006 11:52 pm
Location: Kolkata
Contact:

Post by pravin1581 »

ray.wurlod wrote:If you do not specify sorting where sorting is required, DataStage will insert a tsort operator anyway, with default characteristics. This can be seen in the job score. You will probably end up with a sub-optimal solution.
Even after including a specific sort stage afte the table having 95 million records , the join performance didn't improve it remained the same.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Then my original response ("probably not") remains.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
pravin1581
Premium Member
Premium Member
Posts: 497
Joined: Sun Dec 17, 2006 11:52 pm
Location: Kolkata
Contact:

Post by pravin1581 »

ray.wurlod wrote:Then my original response ("probably not") remains.
We have resorted to Hash partitioning for the link in which sort stage has been inserted the one with the greater volume of data and the other link has Auto partitioning without the sort stage, the one with the smaller volume of data.
Post Reply