Explicit Sort vs Link Sort

Havoc · Post by **Havoc** » Fri Sep 28, 2007 12:41 am

There is a job which does a left outer join (using Join Stage) between two links :

Link1 - 30 million rows
Link2 - 10 million rows (Right Link)

When i use the explicit link sort (hash,sort) on both these links on my join keys , the job aborts after running for some time with this fatal error:

buffer(20),1: APT_BufferOperator: Add block to queue failed. This means that your buffer filesystems all ran out of file space, or that some other system error occurred. Please ensure that you have sufficient scratchdisks in either the default or "buffer" pools on all nodes in your configuration file.

But when I place an explicit Sort Stage on both the input links to the join stage, the job runs successfully to completion.

What exactly happens during a link sort that differs from a Sort Stage? Can someone please throw some light on the process.

Thanks in advance

ray.wurlod · Post by **ray.wurlod** » Fri Sep 28, 2007 1:54 am

Sort stage is more flexible about using memory - you even have the ability to specify the per-node memory usage that the stage takes. Also, with an explicit Sort stage, it may be that your job no longer needs as much information in the inserted buffer operator (see the score) at any one time, so that there are fewer issues about adding more space to the buffer.

There are some environment variables that you can use to tune the buffering, but I believe an explicit Sort stage gives the best of all possible worlds, so advocate using it whenever sorting is required.

Havoc · Post by **Havoc** » Mon Oct 01, 2007 6:45 am

ray.wurlod wrote:Sort stage is more flexible about using memory - you even have the ability to specify the per-node memory usage that the stage takes. Also, with an explicit Sort stage, it may be that your job no lon ...

Does using an explicit sort stage yield better performance as compared to a Link Sort? How does the memory usage between the two vary?

ray.wurlod · Post by **ray.wurlod** » Mon Oct 01, 2007 2:39 pm

It depends. You can tune memory consumption in the Sort stage explicitly for that stage. And "performance" needs to be defined.

Havoc · Post by **Havoc** » Mon Oct 01, 2007 2:44 pm

ray.wurlod wrote:It depends. You can tune memory consumption in the Sort stage explicitly for that stage. And "performance" needs to be defined. ...

Thanks for the reply Ray..

Performance can be defined as :

1) Better throughput from the join stage or improved throughput from upstream stages/operators

2) Job not aborting as the number of records for the join stage increase (lets say 10 million rows/week)

One more question.. how much of an impact does placing/not placing a Sort Stage have on Link Buffering?