Sort Stage, Stage Collector sort

jwiles · Post by **jwiles** » Mon Feb 27, 2012 6:38 pm

I believe the first question has been answered in this thread already. I would add to it by saying that the sort would be added if sort insertion has not been disabled (APT_NO_SORT_INSERTION as documented).

If you mean better as in correct results, then yes. If your data is not sorted properly (and partitioned correctly if running in parallel), you will not receive the correct join results.

Regards,

wblack · Post by **wblack** » Mon Feb 27, 2012 8:06 pm

Is allowing the join to insert a tsort better or any worse than adding a sort stage yourself and then setting the partitioning to SAME in the join stage.

ray.wurlod · Post by **ray.wurlod** » Mon Feb 27, 2012 10:31 pm

I prefer an explicit Sort stage, partly because of the law of least astonishment (don't try to astonish the next programmer) and partly because it gives me control of more things, such as memory used for sorting and generation of key change flags.

PhilHibbs · Post by **PhilHibbs** » Tue Feb 28, 2012 3:40 am

wblack wrote:I have a rather elementary question. If a join expects sorted data and the data isn't sorted it adds a tsort. Is this correct? Also, if a sort stage is added before a join where the data isn't sorted will this make a join perform better?

How would DataStage know if the data is sorted?

Mike · Post by **Mike** » Tue Feb 28, 2012 7:46 am

PhilHibbs wrote:How would DataStage know if the data is sorted?

By looking at the upstream tsort operators arising from inserted sort stages, link sorts, and/or those previously inserted by DataStage.

That is how DataStage can also determine that a user-specified sort does not meet requirements (another error message that may be seen).

Other than that, DataStage can not detect data that is already sorted (e.g. SQL ORDER BY in a connector or dataset sorted in a different job). That is why it is sometimes desirable to insert a sort stage and set it to don't sort, previously sorted.

Generate and review the job score in detail for every job under development.

Mike

priyadarshikunal · Post by **priyadarshikunal** » Tue Feb 28, 2012 8:02 am

That correct!

DataStage will insert sort anyways and hence you will see the tsort operator in job score, however it checks if the incoming data is sorted else sort it.

APT_NO_SORT_INSERTION and APT_INSERT_SORT_CHECKONLY controls this behaviour.

ray.wurlod · Post by **ray.wurlod** » Tue Feb 28, 2012 12:34 pm

Not quite. It's driven not by the data, but by the design. If there's nothing on the input link (a Sort stage or a link sort) indicating that the data are sorted, then a tsort operator will be added when the score is composed.

All three of these methods end up using a tsort operator. Therefore "performance" is not an issue. However, the Sort stage gives you more options, particularly "don't sort (already sorted)" which can boost perceived performance and the ability to allocate more memory than the default so that the sort is more likely to be performed entirely in memory.

priyadarshikunal · Post by **priyadarshikunal** » Wed Feb 29, 2012 6:47 am

ray.wurlod wrote:Not quite. It's driven not by the data, but by the design.

Actually thats what I also meant which mike already mentioned. Apologies for confusing words. What I actually wanted to write is "... however it checks if the incoming data is sorted based on the upstream tsort operators ..."

qt_ky · Post by **qt_ky** » Wed Feb 29, 2012 6:59 pm

I agree with Ray--it's better to "show" the sort by inserting the Sort stage(s) yourself in the job design, and it gives you more control over the sort settings. Win win!

BI-RMA · Post by **BI-RMA** » Thu Mar 01, 2012 10:36 am

wblack wrote:I have an 8-node (4 local, 4 remote) configuration. When my job runs it's worse performance is 8 nodes and as I back the nodes down 7,6,5 the performance improves. When I run 4 local it's the best performance.

In the environment You describe the slowest partition determines when the job is finished. The remote nodes are obviously slower due to network delays. The fact that reduction of the number of nodes leads to continued performance improvements leads me to believe, that You have a case where some kind of sorted repartitioning (or collection) is necessary behind Your Join. So the higher the number of nodes the higher the number of rows from remote nodes that have to be sorted.
If there was no repartitioning/resorting, the job should still be slower when using any of the remote nodes, but increasing the number of nodes should reduce the number of rows per node - and so the job should be faster in total. Remember that repartitioning, sort-merge and the like are costly operations in a parallel environment.

qt_ky · Post by **qt_ky** » Thu Mar 01, 2012 8:12 pm

Options (pick at least two):

A. Don't use any remote nodes for fastest performance as is...

B. Adding more local nodes should make it even faster yet... At some point you will find diminishing returns then worse performance.

C. Tune the network and/or the remote nodes.

D. Tune the job design to avoid re-partitioning the data.