Page 1 of 2

Order By Clause or Explicit Sort Stage

Posted: Wed Dec 06, 2006 2:03 am
by abhilashnair
I am joining two DB2 tables. Is it necessary to put an explicit Sort Stage before join or can I use the ORDER BY clause in the DB2 query itself and then directly link the DB2 stages with the Join stage without using Sort?

Posted: Wed Dec 06, 2006 2:42 am
by ray.wurlod
If you don't put in a Sort stage, DataStage will insert tsort operators, because the Join stage requires sorted inputs. If you have an ORDER BY clause, you can make the Sort stage specify "don't sort (already sorted)" so that it does nothing, but the existence of the Sort stage will prevent DataStage from inserting unnecessary tsort operators.

Posted: Wed Dec 06, 2006 3:02 am
by abhilashnair
Join Stage basically needs the input to be sorted. This is achieved by ORDER BY clause in DB2 stage itself.

Even after specifying ORDER BY in DB2, is there a necessity to include an explicit sort stage, and then again specify "don't sort " in that stage. Is there a point in doing this because it is equivalent to not including Sort Stage.

One can reduce the overhead of including additional stages when the same functionality is provided by the previous stage (ORDER BY clause in this case).

Posted: Wed Dec 06, 2006 4:04 am
by Nageshsunkoji
One can reduce the overhead of including additional stages when the same functionality is provided by the previous stage (ORDER BY clause in this case).[/quote]

Here overhead of one sort stage is not a problem. The tsort operators from Datastage will overhead to your job. The tsort operators will degrade your performance. To avoid tsort opertors use sort satge with the option Don't sort if previously sorted and add environment variable APT_SORT_INSERTION_CHECK_ONLY. It will stop the tsort opertors, if data is already sorted.

Addition of one stage and perofrmance, you have to decide.

Posted: Wed Dec 06, 2006 5:04 am
by abhilashnair
Nagesh, do you mean to say that, if I include explicit sort stage, then DataStage won't use tsort operators?

It does not make sense to me to use sort stage and then again tell the stage "Dont' sort".

Posted: Wed Dec 06, 2006 5:26 am
by ray.wurlod
It may not make sense to you, but it's the technique we use to prevent insertion by the framework of unnecessary tsort operators. If you don't like it, then don't complain about the performance hit taken because of the inserted tsort operators resorting already-sorted data.

Posted: Wed Dec 06, 2006 5:27 am
by Nageshsunkoji
abhilashnair wrote:Nagesh, do you mean to say that, if I include explicit sort stage, then DataStage won't use tsort operators?

It does not make sense to me to use sort stage and then again tell the stage "Dont' sort".
Hi Abhilash,

You can do one thing to avoid one extra sort stage, perform HASH partition on Join stage based on keys as you have already performed on Database and use environmental variable APT_SORT_INSERTION_CHECK_ONLY and try out for result. If your results are proper at the output of Join stage then that's ok for you.

Posted: Wed Dec 06, 2006 11:29 pm
by abhilashnair
I am not clear on this. I have few questions:

1) If I use an ORDER BY clause in DB2 stage, the output data is sorted. This sorted data is passed to join stage. Do you mean to say that in the join stage, tsort operators will again try to sort the already sorted data of the DB2 stage?

2) Suppose I use explicit sort stage in between. In that case, wont the framework again insert tsort in join stage?

3) What exactly is the use of sort stage? What is that APT_SORT_INSERTION_CHECK_ONLY operator used for?

4) If I use explicit sort, is there a need to include ORDER BY in DB2

Posted: Thu Dec 07, 2006 12:14 am
by thebird
1) If I use an ORDER BY clause in DB2 stage, the output data is sorted. This sorted data is passed to join stage. Do you mean to say that in the join stage, tsort operators will again try to sort the already sorted data of the DB2 stage?
Yes. DataStage tries to optimise the code and puts in a tsort operator before the Join, eventhough the data has alredy been sorted in the DB/DB2 stage.
2) Suppose I use explicit sort stage in between. In that case, wont the framework again insert tsort in join stage?
If you put in a Sort stage before the join and mention the property as "Don't Sort", DS will not put in the tsort operator. Else the tsort operator is again put in.
3) What exactly is the use of sort stage? What is that APT_SORT_INSERTION_CHECK_ONLY operator used for?


This Environment Variable checks if the data is sorted or not and only if the data is not sorted inserts the tsort operator into the code for optimisation.
4) If I use explicit sort, is there a need to include ORDER BY in DB2.
If you are using an Order By in the DB2 stage - then put the sort Stage and set the property inside as "Don't Sort".
If you are not doing an Order By in the DB2 stage, then put in the Sort stage, set the option to "Sort" and sort the data before feeding it to the Join stage.

Whatever be the sort option - the partitioning should always be Hash for the data going into the Join stage.

Aneesh

Posted: Thu Dec 07, 2006 12:41 am
by abhilashnair
Thanks Aneesh!!!!! For the replies you gave to each of my questions, that solves all my doubts connected to sort stage.

Posted: Thu Dec 07, 2006 2:13 am
by kumar_s
Sort stage will itself insert a sort operator, and by giving Don't sort, you are suppresing it. Rather you can include the environmental variable APT_NO_SORT_INSERTION in that job if you have used order by in both the source stages.

Posted: Thu Dec 07, 2006 2:52 am
by abhilashnair
What if the output from DB2 is grouped but not sorted? What should be the partition method to bbe used in the Sort stage? Currently it is Auto.

Posted: Thu Dec 07, 2006 2:58 am
by kumar_s
Group by dosn't sort the data and Auto partition doesn't assure you the sorted output. The preferred option is to hold hash partition on the sorting key which should be same or should atleast cover up all the joining key.

Posted: Thu Dec 07, 2006 3:02 am
by ray.wurlod
abhilashnair wrote:What if the output from DB2 is grouped but not sorted? What should be the partition method to bbe used in the Sort stage? Currently it is Auto.
"Don't sort (previously grouped)"

Posted: Thu Dec 07, 2006 3:08 am
by kumar_s
Ray's option is for previously grouped and sorted.