Page 1 of 1

Sorting in Database Vs Datastage

Posted: Wed Apr 28, 2010 1:23 pm
by sshettar
Hi All,

I need to have a small clarification,
i have this job where in there are several join stages ( about 7 join stages)
the first join stage has like 7 input links while the remianing join stages just have 2 input links.

We are running all the jobs on 4 nodes.
Since there are soo many join stages , i wanted to optimize the job .
I wanted to sort the data as mch as possible in the database stages.
but when i do that will i be able to just give the partition method and not opt for sorting the data for that link in the join stage.
I understand that the database could be on diferent nodes config , but will the sorting really matter here?
As in
1a) should i give mention in the join stage for the link where the data is coming from the database stage as sorted as hash
partition on the key fields and select the sort option and check the stable sort in there?
1b) or should i not select the sort option at all and set the APT_NO_SORT_INSERTION to true?


2) is it best to sort the data in the join stage itself?

Thanks in advance

Posted: Wed Apr 28, 2010 1:46 pm
by chulett
If you do the sorting in the database and you want to ensure it doesn't get sorted again after that, add an explicit Sort stage and set the option there to "Don't Sort, Already Sorted". Or, from what I understand, you could do "1b" instead but the former technique will be a little more... visible.

Posted: Wed Apr 28, 2010 1:51 pm
by sshettar
Thanks Craig..
Just one more clarification,
the sort i do on the database will still be good in the process irrespective of the database being on 8 node and the datastage on 4 node?

Thanks in advance

Posted: Wed Apr 28, 2010 2:05 pm
by chulett
Depends on what you mean by 'node' from a database standpoint. Just make sure your sorting is appropriate for the partitioning you are doing in the job and you should be fine regardless of the 'nodes/partitions' of your database.

Posted: Wed Apr 28, 2010 5:13 pm
by ray.wurlod
Which machine has the most free resources?

In a future version (possibly the next version, which some IBMers are now calling 8.5) something called "dynamic optimization" is introduced, where you design a sort (and some other operations) and a dynamic decision is made by the software about pushing some or all of that processing into the database server; it's a cost-based optimization, based on what was presented at IOD 2009.

Posted: Fri May 07, 2010 9:30 pm
by zulfi123786
There was a small typing mistake in my earlier post, sorry for that :

I have a little confusion here as

The DB2 database stage runs sequentially and if i connect it to a join stage which runs in parallel the DataStage will partition the data across the nodes and when this happens i guess there is a chance for the sort order of the records to gets messed up resulting in a un-sorted data within each node (partition) and this is not what we would like to have in a join stage which requires sorted inputs so i guess it's always better to hash and sort in the join stage input link itself............

Gurus please share your thoughts over the above

Thanks

Posted: Sat May 08, 2010 1:52 am
by ray.wurlod
You must take control of what happens. Partitioning on the sort key will not destroy sorted order (think about it).

Posted: Sun May 09, 2010 11:44 pm
by zulfi123786
What i meant was if we have a stage that runs sequentially the data will be available on the conductor node (partition) [only if the stage runs parallely the data will be partitioned across available nodes else only the conductor partition will hold the data, this is what is presume] and suppose that the source data (sequential) which is sorted is now partitioned (hash partitioned) acorss 4 nodes then the data within each node i guess will no longer be in sorted order unless i set explicitly the inline sort.

If my above understanding is correct then i feel it's better we sort the data on the i/p link of the join rather than putting an order by in the database query.

Posted: Mon May 10, 2010 12:22 am
by ray.wurlod
As I said, think about it. Sketch it out with pencil and paper. Make sure you are partitioning on the sort key. This means that all rows with the same value of the sort key will be on the same partition as each other; all rows with a different value of the sort key will be on the same partition as each other, and so on. If the data happen to be pre-sorted, this will not be disrupted.

Posted: Mon May 10, 2010 12:42 am
by zulfi123786
Many Thanks Ray for you clarification
I was with the impression that when partitioning algorithm runs over sorted data it may disrupt the sort order of already sorted data which is now clear that i was worng.
:)

Posted: Mon May 10, 2010 4:00 pm
by ray.wurlod
You're not necessarily wrong, or even worng. :lol:

If partitioning is on something other than the sort key, then chances are that it will disrupt sorting - though existing sort order per partition is likely to be preserved.

Again, sketch it out on paper or a whiteboard to understand what occurs.

Posted: Tue May 11, 2010 1:04 am
by zulfi123786
Throught the discussion i was talking about the partitioning over the sort key........ :)