Sort Stage

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Sort Stage

Post by asitagrawal »

Hi,

I am using Sort Stage in my job.
Its output is an input to an Aggregator stage.

The Help of Sort Stage specifies:
1) Max Rows in Virtual Memory : The maximum number of rows (from 2 to 50,000)

2)Max Open Files. = 20.

So in all I can sort just 50,000 * 20 = 1,000,000 rows.

But I ahev more than this, say 2 millions which will first be sorted and then aggregated. So how do I acheive it here?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

All that's fine. You just end up making more use of scratch disk, which will slow execution.

If you have an external sort utility, or your own sort routine, you might invoke that instead. For example, the sort command in MKS Toolkit (a UNIX-on-Windows utility) is more efficient than the server Sort stage for sorting data.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Post by asitagrawal »

ray.wurlod wrote:All that's fine. You just end up making more use of scratch disk, which will slow execution.

If you have an external sort utility, or your own sort routine, you might invoke that instead. For exa ...
I cannot read ur reply as it is marked as 'Premium Content'.
tagnihotri
Participant
Posts: 83
Joined: Sat Oct 28, 2006 6:25 am

Re: Sort Stage

Post by tagnihotri »

Sort stage do work fine with 2 or more million records (have tested). To get better performance do remmember to set the stable sort option false. But you may get errors or time out if the records lenght or number of keys are more. For that you can try increasing the I/O Max buffer (fby seeting the env variable).


asitagrawal wrote:Hi,

I am using Sort Stage in my job.
Its output is an input to an Aggregator stage.

The Help of Sort Stage specifies:
1) Max Rows in Virtual Memory : The maximum number of rows (from 2 to 50,000)

2)Max Open Files. = 20.

So in all I can sort just 50,000 * 20 = 1,000,000 rows.

But I ahev more than this, say 2 millions which will first be sorted and then aggregated. So how do I acheive it here?
Kirtikumar
Participant
Posts: 437
Joined: Fri Oct 15, 2004 6:13 am
Location: Pune, India

Post by Kirtikumar »

Sort stage in server has performance and other issues when it is applied on large number of rows. You can find all of them if you do a search on this forum for sort stage in server.

On Unix environment, normally instead of using the sort stage, it is better to use the Unix SORT utility (means sort command) which works faster than DataStage server sort stage.

Am not sure if similar sort is there in windows or not? If it is and then its perfomance you need to check before using it instead of sort stage.

There are many 3rd party sorts like tsort that you can get to do the sorting faster. You can invoke such sort command through command filters in seq file stage. Search the forum and you will find many posts on this.

If the count is going to be constant then sort stage is OK, but if it increases to billions then better to think of sort utility.
Regards,
S. Kirtikumar.
tagnihotri
Participant
Posts: 83
Joined: Sat Oct 28, 2006 6:25 am

Post by tagnihotri »

UNIX sort hmm now thats again an open topic! as in we can even opt for UNIX sort in the sort stage or go for before\after job routine!

I do second with you when you say sort gives issues with very high volume of data but there are reasons and solution for that.

Also a question for experts out here, Which sort algo DS usses inertnaly ?

Kirtikumar wrote:Sort stage in server has performance and other issues when it is applied on large number of rows. You can find all of them if you do a search on this forum for sort stage in server.

On Unix environment, normally instead of using the sort stage, it is better to use the Unix SORT utility (means sort command) which works faster than DataStage server sort stage.

Am not sure if similar sort is there in windows or not? If it is and then its perfomance you need to check before using it instead of sort stage.

There are many 3rd party sorts like tsort that you can get to do the sorting faster. You can invoke such sort command through command filters in seq file stage. Search the forum and you will find many posts on this.

If the count is going to be constant then sort stage is OK, but if it increases to billions then better to think of sort utility.
meena
Participant
Posts: 430
Joined: Tue Sep 13, 2005 12:17 pm

Post by meena »

Hi asitagrawal
Ray is talking about external sort utility and about the sort command in MKS Toolkit (a UNIX-on-Windows utility).
I cannot read ur reply as it is marked as 'Premium Content'.
For tagnihotri ,
I never heard of I/O max buffer in server job for a sort stage are you talking about parallel sort stage. If I am wrong correct me. And next we have an option of calling before/after routine in sort stage (mean sort stage in server job).
Sort stage do work fine with 2 or more million records (have tested). To get better performance do remmember to set the stable sort option false. But you may get errors or time out if the records lenght or number of keys are more. For that you can try increasing the I/O Max buffer (fby seeting the env variable).
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

asitagrawal wrote:I cannot read ur reply as it is marked as 'Premium Content'.
For less than $1 per week you can read premium posts and bask in the contentment that you're helping to sustain this site.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
tagnihotri
Participant
Posts: 83
Joined: Sat Oct 28, 2006 6:25 am

Post by tagnihotri »

Apologies the env variable I have specified is for PX!! hmm I am not sure whether its age or work getting on to me :roll:

But I still hold the point of server sort performing good enough to handle huge data, the only cases of abort I have seen is because of the temp space being full for which we can explicitly specify the working area.

Secondly from what I can understand you have placed a aggre after the sort stage and I belive the sort keys must be same as the aggregation keys, so better explicitly set the 'no sort' or 'ignore' option in the aggregator stage! Because from what I have seen the aggre stage go for a re-sort or check of whether data is sorted or not which do consume a lot of cpu time!
meena wrote:Hi asitagrawal
Ray is talking about external sort utility and about the sort command in MKS Toolkit (a UNIX-on-Windows utility).
I cannot read ur reply as it is marked as 'Premium Content'.
For tagnihotri ,
I never heard of I/O max buffer in server job for a sort stage are you talking about parallel sort stage. If I am wrong correct me. And next we have an option of calling before/after routine in sort stage (mean sort stage in server job).
Sort stage do work fine with 2 or more million records (have tested). To get better performance do remmember to set the stable sort option false. But you may get errors or time out if the records lenght or number of keys are more. For that you can try increasing the I/O Max buffer (fby seeting the env variable).
Post Reply