Explicit Sort vs Link Sort

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Havoc
Participant
Posts: 110
Joined: Fri Nov 24, 2006 8:26 am

Explicit Sort vs Link Sort

Post by Havoc »

There is a job which does a left outer join (using Join Stage) between two links :

Link1 - 30 million rows
Link2 - 10 million rows (Right Link)

When i use the explicit link sort (hash,sort) on both these links on my join keys , the job aborts after running for some time with this fatal error:

buffer(20),1: APT_BufferOperator: Add block to queue failed. This means that your buffer filesystems all ran out of file space, or that some other system error occurred. Please ensure that you have sufficient scratchdisks in either the default or "buffer" pools on all nodes in your configuration file.

But when I place an explicit Sort Stage on both the input links to the join stage, the job runs successfully to completion.

What exactly happens during a link sort that differs from a Sort Stage? Can someone please throw some light on the process.

Thanks in advance :)
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Sort stage is more flexible about using memory - you even have the ability to specify the per-node memory usage that the stage takes. Also, with an explicit Sort stage, it may be that your job no longer needs as much information in the inserted buffer operator (see the score) at any one time, so that there are fewer issues about adding more space to the buffer.

There are some environment variables that you can use to tune the buffering, but I believe an explicit Sort stage gives the best of all possible worlds, so advocate using it whenever sorting is required.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Havoc
Participant
Posts: 110
Joined: Fri Nov 24, 2006 8:26 am

Post by Havoc »

ray.wurlod wrote:Sort stage is more flexible about using memory - you even have the ability to specify the per-node memory usage that the stage takes. Also, with an explicit Sort stage, it may be that your job no lon ...
Does using an explicit sort stage yield better performance as compared to a Link Sort? How does the memory usage between the two vary?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It depends. You can tune memory consumption in the Sort stage explicitly for that stage. And "performance" needs to be defined.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Havoc
Participant
Posts: 110
Joined: Fri Nov 24, 2006 8:26 am

Post by Havoc »

ray.wurlod wrote:It depends. You can tune memory consumption in the Sort stage explicitly for that stage. And "performance" needs to be defined. ...
Thanks for the reply Ray..

Performance can be defined as :

1) Better throughput from the join stage or improved throughput from upstream stages/operators

2) Job not aborting as the number of records for the join stage increase (lets say 10 million rows/week)

One more question.. how much of an impact does placing/not placing a Sort Stage have on Link Buffering?
Post Reply