Difference between Sort stage and inline Sort

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
benny.lbs
Participant
Posts: 125
Joined: Wed Feb 23, 2005 3:46 am

Difference between Sort stage and inline Sort

Post by benny.lbs »

Could anyone tell the difference between Sort stage and inline Sort ? Have performance issue ?

Thanks a lot !
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

They both sort data if configured correctly. "Performance issue" is, more than anything, a matter of expectations. To determine which completes more quickly you can create jobs that use both methods, but remember to be fair - allow for cache effects (ideally re-boot server between tests), and sort many different sets of data with different characteristics. Post your results here, if you would be so kind.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi Ray,
Its known fact that explicit sort is more efficient than an implicit sort. But i really dont understant how can datastage utility sort is more efficient than Unix one. In case of dataset, ok, datastage utility is the only way. But in case of sequential file, i was expecting unix to be more faster. :roll: Is it something related to hashing the key and processing.... :roll:

And also for Counting number of records, when i tested with some small files (few GBs) with unix wc command is far better than aggregator to count the number of records....

regards
kumar
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Neither the Sort stage nor the implicit sort (by which I assume you to mean sorting specified on the input link of a stage) goes out to UNIX sort command - at least by default. If you want to bring UNIX sort into the mix why not bring third-part sort utilities such as SyncSort or CoSort in as well? These survive solely by being faster than anything else.

DataStage has overheads that the UNIX sort command does not have, even if processing a single sequential file. Not least of these is the process overhead - conductor, section leader(s), players. On the other hand, the data stream is within DataStage, so you can go on and do other things with it.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
track_star
Participant
Posts: 60
Joined: Sat Jan 24, 2004 12:52 pm
Location: Mount Carmel, IL

Post by track_star »

Boys and girls, the inline sort and sort stage are the same thing. Check the code....tsort=tsort. You have a few extra options in the sort stage that you don't get in the inline sort, but other than that, it's the same.

As for why tsort is faster than a plain UNIX sort, it sorts each partition (assuming you have a config file with multiple nodes defined). And unless you write an elaborate shell script, sort won't do that on its own. Back in PX 6.x, you could call SyncSort directly from the sort stage, but most shops couldn't afford licenses for both SyncSort and PX, so they took that functionality out at 7.0.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

HI,
What is sthe underlying operation if we choose sort utility as unix to sort a dataset. And what makes datastage sort to differ from it.

regards
kumar
Post Reply