Difference between Sort stage and inline Sort

benny.lbs · Post by **benny.lbs** » Tue Oct 11, 2005 2:49 am

Could anyone tell the difference between Sort stage and inline Sort ? Have performance issue ?

Thanks a lot !

ray.wurlod · Post by **ray.wurlod** » Tue Oct 11, 2005 3:03 am

They both sort data if configured correctly. "Performance issue" is, more than anything, a matter of expectations. To determine which completes more quickly you can create jobs that use both methods, but remember to be fair - allow for cache effects (ideally re-boot server between tests), and sort many different sets of data with different characteristics. Post your results here, if you would be so kind.

kumar_s · Post by **kumar_s** » Tue Oct 11, 2005 6:45 am

Hi Ray,
Its known fact that explicit sort is more efficient than an implicit sort. But i really dont understant how can datastage utility sort is more efficient than Unix one. In case of dataset, ok, datastage utility is the only way. But in case of sequential file, i was expecting unix to be more faster.

Is it something related to hashing the key and processing....

And also for Counting number of records, when i tested with some small files (few GBs) with unix wc command is far better than aggregator to count the number of records....

regards
kumar

ray.wurlod · Post by **ray.wurlod** » Tue Oct 11, 2005 2:48 pm

Neither the Sort stage nor the implicit sort (by which I assume you to mean sorting specified on the input link of a stage) goes out to UNIX sort command - at least by default. If you want to bring UNIX sort into the mix why not bring third-part sort utilities such as SyncSort or CoSort in as well? These survive solely by being faster than anything else.

DataStage has overheads that the UNIX sort command does not have, even if processing a single sequential file. Not least of these is the process overhead - conductor, section leader(s), players. On the other hand, the data stream is within DataStage, so you can go on and do other things with it.

track_star · Post by **track_star** » Tue Oct 11, 2005 3:22 pm

Boys and girls, the inline sort and sort stage are the same thing. Check the code....tsort=tsort. You have a few extra options in the sort stage that you don't get in the inline sort, but other than that, it's the same.

As for why tsort is faster than a plain UNIX sort, it sorts each partition (assuming you have a config file with multiple nodes defined). And unless you write an elaborate shell script, sort won't do that on its own. Back in PX 6.x, you could call SyncSort directly from the sort stage, but most shops couldn't afford licenses for both SyncSort and PX, so they took that functionality out at 7.0.

kumar_s · Post by **kumar_s** » Sat Oct 22, 2005 6:29 am

HI,
What is sthe underlying operation if we choose sort utility as unix to sort a dataset. And what makes datastage sort to differ from it.

regards
kumar