Sort Stage, Stage Collector sort

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

I believe the first question has been answered in this thread already. I would add to it by saying that the sort would be added if sort insertion has not been disabled (APT_NO_SORT_INSERTION as documented).

If you mean better as in correct results, then yes. If your data is not sorted properly (and partitioned correctly if running in parallel), you will not receive the correct join results.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
wblack
Premium Member
Premium Member
Posts: 30
Joined: Thu Sep 23, 2010 7:55 am

Post by wblack »

Is allowing the join to insert a tsort better or any worse than adding a sort stage yourself and then setting the partitioning to SAME in the join stage.
William Black
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I prefer an explicit Sort stage, partly because of the law of least astonishment (don't try to astonish the next programmer) and partly because it gives me control of more things, such as memory used for sorting and generation of key change flags.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

wblack wrote:I have a rather elementary question. If a join expects sorted data and the data isn't sorted it adds a tsort. Is this correct? Also, if a sort stage is added before a join where the data isn't sorted will this make a join perform better?
How would DataStage know if the data is sorted?
Phil Hibbs | Capgemini
Technical Consultant
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

PhilHibbs wrote:How would DataStage know if the data is sorted?
By looking at the upstream tsort operators arising from inserted sort stages, link sorts, and/or those previously inserted by DataStage.

That is how DataStage can also determine that a user-specified sort does not meet requirements (another error message that may be seen).

Other than that, DataStage can not detect data that is already sorted (e.g. SQL ORDER BY in a connector or dataset sorted in a different job). That is why it is sometimes desirable to insert a sort stage and set it to don't sort, previously sorted.

Generate and review the job score in detail for every job under development.

Mike
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

That correct!

DataStage will insert sort anyways and hence you will see the tsort operator in job score, however it checks if the incoming data is sorted else sort it.

APT_NO_SORT_INSERTION and APT_INSERT_SORT_CHECKONLY controls this behaviour.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Not quite. It's driven not by the data, but by the design. If there's nothing on the input link (a Sort stage or a link sort) indicating that the data are sorted, then a tsort operator will be added when the score is composed.

All three of these methods end up using a tsort operator. Therefore "performance" is not an issue. However, the Sort stage gives you more options, particularly "don't sort (already sorted)" which can boost perceived performance and the ability to allocate more memory than the default so that the sort is more likely to be performed entirely in memory.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

ray.wurlod wrote:Not quite. It's driven not by the data, but by the design.
Actually thats what I also meant which mike already mentioned. Apologies for confusing words. What I actually wanted to write is "... however it checks if the incoming data is sorted based on the upstream tsort operators ..."
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

I agree with Ray--it's better to "show" the sort by inserting the Sort stage(s) yourself in the job design, and it gives you more control over the sort settings. Win win!
Choose a job you love, and you will never have to work a day in your life. - Confucius
BI-RMA
Premium Member
Premium Member
Posts: 463
Joined: Sun Nov 01, 2009 3:55 pm
Location: Hamburg

Post by BI-RMA »

wblack wrote:I have an 8-node (4 local, 4 remote) configuration. When my job runs it's worse performance is 8 nodes and as I back the nodes down 7,6,5 the performance improves. When I run 4 local it's the best performance.
In the environment You describe the slowest partition determines when the job is finished. The remote nodes are obviously slower due to network delays. The fact that reduction of the number of nodes leads to continued performance improvements leads me to believe, that You have a case where some kind of sorted repartitioning (or collection) is necessary behind Your Join. So the higher the number of nodes the higher the number of rows from remote nodes that have to be sorted.
If there was no repartitioning/resorting, the job should still be slower when using any of the remote nodes, but increasing the number of nodes should reduce the number of rows per node - and so the job should be faster in total. Remember that repartitioning, sort-merge and the like are costly operations in a parallel environment.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

Options (pick at least two):

A. Don't use any remote nodes for fastest performance as is...

B. Adding more local nodes should make it even faster yet... At some point you will find diminishing returns then worse performance.

C. Tune the network and/or the remote nodes.

D. Tune the job design to avoid re-partitioning the data.
Choose a job you love, and you will never have to work a day in your life. - Confucius
Post Reply