Sort Stage, Stage Collector sort
Moderators: chulett, rschirm, roy
I believe the first question has been answered in this thread already. I would add to it by saying that the sort would be added if sort insertion has not been disabled (APT_NO_SORT_INSERTION as documented).
If you mean better as in correct results, then yes. If your data is not sorted properly (and partitioned correctly if running in parallel), you will not receive the correct join results.
Regards,
If you mean better as in correct results, then yes. If your data is not sorted properly (and partitioned correctly if running in parallel), you will not receive the correct join results.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
I prefer an explicit Sort stage, partly because of the law of least astonishment (don't try to astonish the next programmer) and partly because it gives me control of more things, such as memory used for sorting and generation of key change flags.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 1044
- Joined: Wed Sep 29, 2004 3:30 am
- Location: Nottingham, UK
- Contact:
How would DataStage know if the data is sorted?wblack wrote:I have a rather elementary question. If a join expects sorted data and the data isn't sorted it adds a tsort. Is this correct? Also, if a sort stage is added before a join where the data isn't sorted will this make a join perform better?
Phil Hibbs | Capgemini
Technical Consultant
Technical Consultant
By looking at the upstream tsort operators arising from inserted sort stages, link sorts, and/or those previously inserted by DataStage.PhilHibbs wrote:How would DataStage know if the data is sorted?
That is how DataStage can also determine that a user-specified sort does not meet requirements (another error message that may be seen).
Other than that, DataStage can not detect data that is already sorted (e.g. SQL ORDER BY in a connector or dataset sorted in a different job). That is why it is sometimes desirable to insert a sort stage and set it to don't sort, previously sorted.
Generate and review the job score in detail for every job under development.
Mike
-
- Premium Member
- Posts: 1735
- Joined: Thu Mar 01, 2007 5:44 am
- Location: Troy, MI
That correct!
DataStage will insert sort anyways and hence you will see the tsort operator in job score, however it checks if the incoming data is sorted else sort it.
APT_NO_SORT_INSERTION and APT_INSERT_SORT_CHECKONLY controls this behaviour.
DataStage will insert sort anyways and hence you will see the tsort operator in job score, however it checks if the incoming data is sorted else sort it.
APT_NO_SORT_INSERTION and APT_INSERT_SORT_CHECKONLY controls this behaviour.
Priyadarshi Kunal
Genius may have its limitations, but stupidity is not thus handicapped.![Wink :wink:](./images/smilies/icon_wink.gif)
Genius may have its limitations, but stupidity is not thus handicapped.
![Wink :wink:](./images/smilies/icon_wink.gif)
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Not quite. It's driven not by the data, but by the design. If there's nothing on the input link (a Sort stage or a link sort) indicating that the data are sorted, then a tsort operator will be added when the score is composed.
All three of these methods end up using a tsort operator. Therefore "performance" is not an issue. However, the Sort stage gives you more options, particularly "don't sort (already sorted)" which can boost perceived performance and the ability to allocate more memory than the default so that the sort is more likely to be performed entirely in memory.
All three of these methods end up using a tsort operator. Therefore "performance" is not an issue. However, the Sort stage gives you more options, particularly "don't sort (already sorted)" which can boost perceived performance and the ability to allocate more memory than the default so that the sort is more likely to be performed entirely in memory.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 1735
- Joined: Thu Mar 01, 2007 5:44 am
- Location: Troy, MI
Actually thats what I also meant which mike already mentioned. Apologies for confusing words. What I actually wanted to write is "... however it checks if the incoming data is sorted based on the upstream tsort operators ..."ray.wurlod wrote:Not quite. It's driven not by the data, but by the design.
Priyadarshi Kunal
Genius may have its limitations, but stupidity is not thus handicapped.![Wink :wink:](./images/smilies/icon_wink.gif)
Genius may have its limitations, but stupidity is not thus handicapped.
![Wink :wink:](./images/smilies/icon_wink.gif)
In the environment You describe the slowest partition determines when the job is finished. The remote nodes are obviously slower due to network delays. The fact that reduction of the number of nodes leads to continued performance improvements leads me to believe, that You have a case where some kind of sorted repartitioning (or collection) is necessary behind Your Join. So the higher the number of nodes the higher the number of rows from remote nodes that have to be sorted.wblack wrote:I have an 8-node (4 local, 4 remote) configuration. When my job runs it's worse performance is 8 nodes and as I back the nodes down 7,6,5 the performance improves. When I run 4 local it's the best performance.
If there was no repartitioning/resorting, the job should still be slower when using any of the remote nodes, but increasing the number of nodes should reduce the number of rows per node - and so the job should be faster in total. Remember that repartitioning, sort-merge and the like are costly operations in a parallel environment.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
There are the grateful those are happy." Francis Bacon
Options (pick at least two):
A. Don't use any remote nodes for fastest performance as is...
B. Adding more local nodes should make it even faster yet... At some point you will find diminishing returns then worse performance.
C. Tune the network and/or the remote nodes.
D. Tune the job design to avoid re-partitioning the data.
A. Don't use any remote nodes for fastest performance as is...
B. Adding more local nodes should make it even faster yet... At some point you will find diminishing returns then worse performance.
C. Tune the network and/or the remote nodes.
D. Tune the job design to avoid re-partitioning the data.
Choose a job you love, and you will never have to work a day in your life. - Confucius