Sort Stage

asitagrawal · Post by **asitagrawal** » Wed Nov 01, 2006 12:23 pm

Hi,

I am using Sort Stage in my job.
Its output is an input to an Aggregator stage.

The Help of Sort Stage specifies:
1) Max Rows in Virtual Memory : The maximum number of rows (from 2 to 50,000)

2)Max Open Files. = 20.

So in all I can sort just 50,000 * 20 = 1,000,000 rows.

But I ahev more than this, say 2 millions which will first be sorted and then aggregated. So how do I acheive it here?

ray.wurlod · Post by **ray.wurlod** » Wed Nov 01, 2006 12:44 pm

All that's fine. You just end up making more use of scratch disk, which will slow execution.

If you have an external sort utility, or your own sort routine, you might invoke that instead. For example, the sort command in MKS Toolkit (a UNIX-on-Windows utility) is more efficient than the server Sort stage for sorting data.

asitagrawal · Post by **asitagrawal** » Wed Nov 01, 2006 12:47 pm

ray.wurlod wrote:All that's fine. You just end up making more use of scratch disk, which will slow execution.

If you have an external sort utility, or your own sort routine, you might invoke that instead. For exa ...

I cannot read ur reply as it is marked as 'Premium Content'.

tagnihotri · Post by **tagnihotri** » Tue Nov 07, 2006 2:35 am

Sort stage do work fine with 2 or more million records (have tested). To get better performance do remmember to set the stable sort option false. But you may get errors or time out if the records lenght or number of keys are more. For that you can try increasing the I/O Max buffer (fby seeting the env variable).

asitagrawal wrote:Hi,

I am using Sort Stage in my job.
Its output is an input to an Aggregator stage.

The Help of Sort Stage specifies:
1) Max Rows in Virtual Memory : The maximum number of rows (from 2 to 50,000)

2)Max Open Files. = 20.

So in all I can sort just 50,000 * 20 = 1,000,000 rows.

But I ahev more than this, say 2 millions which will first be sorted and then aggregated. So how do I acheive it here?

Kirtikumar · Post by **Kirtikumar** » Tue Nov 07, 2006 4:47 am

Sort stage in server has performance and other issues when it is applied on large number of rows. You can find all of them if you do a search on this forum for sort stage in server.

On Unix environment, normally instead of using the sort stage, it is better to use the Unix SORT utility (means sort command) which works faster than DataStage server sort stage.

Am not sure if similar sort is there in windows or not? If it is and then its perfomance you need to check before using it instead of sort stage.

There are many 3rd party sorts like tsort that you can get to do the sorting faster. You can invoke such sort command through command filters in seq file stage. Search the forum and you will find many posts on this.

If the count is going to be constant then sort stage is OK, but if it increases to billions then better to think of sort utility.

tagnihotri · Post by **tagnihotri** » Tue Nov 07, 2006 9:08 am

UNIX sort hmm now thats again an open topic! as in we can even opt for UNIX sort in the sort stage or go for before\after job routine!

I do second with you when you say sort gives issues with very high volume of data but there are reasons and solution for that.

Also a question for experts out here, Which sort algo DS usses inertnaly ?

Kirtikumar wrote:Sort stage in server has performance and other issues when it is applied on large number of rows. You can find all of them if you do a search on this forum for sort stage in server.

On Unix environment, normally instead of using the sort stage, it is better to use the Unix SORT utility (means sort command) which works faster than DataStage server sort stage.

Am not sure if similar sort is there in windows or not? If it is and then its perfomance you need to check before using it instead of sort stage.

There are many 3rd party sorts like tsort that you can get to do the sorting faster. You can invoke such sort command through command filters in seq file stage. Search the forum and you will find many posts on this.

If the count is going to be constant then sort stage is OK, but if it increases to billions then better to think of sort utility.

meena · Post by **meena** » Tue Nov 07, 2006 9:51 am

Hi asitagrawal
Ray is talking about external sort utility and about the sort command in MKS Toolkit (a UNIX-on-Windows utility).

I cannot read ur reply as it is marked as 'Premium Content'.

For tagnihotri ,
I never heard of I/O max buffer in server job for a sort stage are you talking about parallel sort stage. If I am wrong correct me. And next we have an option of calling before/after routine in sort stage (mean sort stage in server job).

Sort stage do work fine with 2 or more million records (have tested). To get better performance do remmember to set the stable sort option false. But you may get errors or time out if the records lenght or number of keys are more. For that you can try increasing the I/O Max buffer (fby seeting the env variable).

ray.wurlod · Post by **ray.wurlod** » Tue Nov 07, 2006 5:01 pm

asitagrawal wrote:I cannot read ur reply as it is marked as 'Premium Content'.

For less than $1 per week you can read premium posts and bask in the contentment that you're helping to sustain this site.

tagnihotri · Post by **tagnihotri** » Wed Nov 08, 2006 12:25 am

Apologies the env variable I have specified is for PX!! hmm I am not sure whether its age or work getting on to me

But I still hold the point of server sort performing good enough to handle huge data, the only cases of abort I have seen is because of the temp space being full for which we can explicitly specify the working area.

Secondly from what I can understand you have placed a aggre after the sort stage and I belive the sort keys must be same as the aggregation keys, so better explicitly set the 'no sort' or 'ignore' option in the aggregator stage! Because from what I have seen the aggre stage go for a re-sort or check of whether data is sorted or not which do consume a lot of cpu time!

meena wrote:Hi asitagrawal
Ray is talking about external sort utility and about the sort command in MKS Toolkit (a UNIX-on-Windows utility).
I cannot read ur reply as it is marked as 'Premium Content'.
For tagnihotri ,
I never heard of I/O max buffer in server job for a sort stage are you talking about parallel sort stage. If I am wrong correct me. And next we have an option of calling before/after routine in sort stage (mean sort stage in server job).
Sort stage do work fine with 2 or more million records (have tested). To get better performance do remmember to set the stable sort option false. But you may get errors or time out if the records lenght or number of keys are more. For that you can try increasing the I/O Max buffer (fby seeting the env variable).

DSXchange

Sort Stage

Sort Stage

Re: Sort Stage