Order By Clause or Explicit Sort Stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

abhilashnair
Participant
Posts: 284
Joined: Fri Oct 13, 2006 4:31 am

Order By Clause or Explicit Sort Stage

Post by abhilashnair »

I am joining two DB2 tables. Is it necessary to put an explicit Sort Stage before join or can I use the ORDER BY clause in the DB2 query itself and then directly link the DB2 stages with the Join stage without using Sort?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If you don't put in a Sort stage, DataStage will insert tsort operators, because the Join stage requires sorted inputs. If you have an ORDER BY clause, you can make the Sort stage specify "don't sort (already sorted)" so that it does nothing, but the existence of the Sort stage will prevent DataStage from inserting unnecessary tsort operators.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
abhilashnair
Participant
Posts: 284
Joined: Fri Oct 13, 2006 4:31 am

Post by abhilashnair »

Join Stage basically needs the input to be sorted. This is achieved by ORDER BY clause in DB2 stage itself.

Even after specifying ORDER BY in DB2, is there a necessity to include an explicit sort stage, and then again specify "don't sort " in that stage. Is there a point in doing this because it is equivalent to not including Sort Stage.

One can reduce the overhead of including additional stages when the same functionality is provided by the previous stage (ORDER BY clause in this case).
Nageshsunkoji
Participant
Posts: 222
Joined: Tue Aug 30, 2005 2:07 am
Location: pune
Contact:

Post by Nageshsunkoji »

One can reduce the overhead of including additional stages when the same functionality is provided by the previous stage (ORDER BY clause in this case).[/quote]

Here overhead of one sort stage is not a problem. The tsort operators from Datastage will overhead to your job. The tsort operators will degrade your performance. To avoid tsort opertors use sort satge with the option Don't sort if previously sorted and add environment variable APT_SORT_INSERTION_CHECK_ONLY. It will stop the tsort opertors, if data is already sorted.

Addition of one stage and perofrmance, you have to decide.
NageshSunkoji

If you know anything SHARE it.............
If you Don't know anything LEARN it...............
abhilashnair
Participant
Posts: 284
Joined: Fri Oct 13, 2006 4:31 am

Post by abhilashnair »

Nagesh, do you mean to say that, if I include explicit sort stage, then DataStage won't use tsort operators?

It does not make sense to me to use sort stage and then again tell the stage "Dont' sort".
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It may not make sense to you, but it's the technique we use to prevent insertion by the framework of unnecessary tsort operators. If you don't like it, then don't complain about the performance hit taken because of the inserted tsort operators resorting already-sorted data.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Nageshsunkoji
Participant
Posts: 222
Joined: Tue Aug 30, 2005 2:07 am
Location: pune
Contact:

Post by Nageshsunkoji »

abhilashnair wrote:Nagesh, do you mean to say that, if I include explicit sort stage, then DataStage won't use tsort operators?

It does not make sense to me to use sort stage and then again tell the stage "Dont' sort".
Hi Abhilash,

You can do one thing to avoid one extra sort stage, perform HASH partition on Join stage based on keys as you have already performed on Database and use environmental variable APT_SORT_INSERTION_CHECK_ONLY and try out for result. If your results are proper at the output of Join stage then that's ok for you.
NageshSunkoji

If you know anything SHARE it.............
If you Don't know anything LEARN it...............
abhilashnair
Participant
Posts: 284
Joined: Fri Oct 13, 2006 4:31 am

Post by abhilashnair »

I am not clear on this. I have few questions:

1) If I use an ORDER BY clause in DB2 stage, the output data is sorted. This sorted data is passed to join stage. Do you mean to say that in the join stage, tsort operators will again try to sort the already sorted data of the DB2 stage?

2) Suppose I use explicit sort stage in between. In that case, wont the framework again insert tsort in join stage?

3) What exactly is the use of sort stage? What is that APT_SORT_INSERTION_CHECK_ONLY operator used for?

4) If I use explicit sort, is there a need to include ORDER BY in DB2
thebird
Participant
Posts: 254
Joined: Thu Jan 06, 2005 12:11 am
Location: India
Contact:

Post by thebird »

1) If I use an ORDER BY clause in DB2 stage, the output data is sorted. This sorted data is passed to join stage. Do you mean to say that in the join stage, tsort operators will again try to sort the already sorted data of the DB2 stage?
Yes. DataStage tries to optimise the code and puts in a tsort operator before the Join, eventhough the data has alredy been sorted in the DB/DB2 stage.
2) Suppose I use explicit sort stage in between. In that case, wont the framework again insert tsort in join stage?
If you put in a Sort stage before the join and mention the property as "Don't Sort", DS will not put in the tsort operator. Else the tsort operator is again put in.
3) What exactly is the use of sort stage? What is that APT_SORT_INSERTION_CHECK_ONLY operator used for?


This Environment Variable checks if the data is sorted or not and only if the data is not sorted inserts the tsort operator into the code for optimisation.
4) If I use explicit sort, is there a need to include ORDER BY in DB2.
If you are using an Order By in the DB2 stage - then put the sort Stage and set the property inside as "Don't Sort".
If you are not doing an Order By in the DB2 stage, then put in the Sort stage, set the option to "Sort" and sort the data before feeding it to the Join stage.

Whatever be the sort option - the partitioning should always be Hash for the data going into the Join stage.

Aneesh
abhilashnair
Participant
Posts: 284
Joined: Fri Oct 13, 2006 4:31 am

Post by abhilashnair »

Thanks Aneesh!!!!! For the replies you gave to each of my questions, that solves all my doubts connected to sort stage.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Sort stage will itself insert a sort operator, and by giving Don't sort, you are suppresing it. Rather you can include the environmental variable APT_NO_SORT_INSERTION in that job if you have used order by in both the source stages.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
abhilashnair
Participant
Posts: 284
Joined: Fri Oct 13, 2006 4:31 am

Post by abhilashnair »

What if the output from DB2 is grouped but not sorted? What should be the partition method to bbe used in the Sort stage? Currently it is Auto.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Group by dosn't sort the data and Auto partition doesn't assure you the sorted output. The preferred option is to hold hash partition on the sorting key which should be same or should atleast cover up all the joining key.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

abhilashnair wrote:What if the output from DB2 is grouped but not sorted? What should be the partition method to bbe used in the Sort stage? Currently it is Auto.
"Don't sort (previously grouped)"
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Ray's option is for previously grouped and sorted.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Post Reply