Sorting in Database Vs Datastage

sshettar · Post by **sshettar** » Wed Apr 28, 2010 1:23 pm

Hi All,

I need to have a small clarification,
i have this job where in there are several join stages ( about 7 join stages)
the first join stage has like 7 input links while the remianing join stages just have 2 input links.

We are running all the jobs on 4 nodes.
Since there are soo many join stages , i wanted to optimize the job .
I wanted to sort the data as mch as possible in the database stages.
but when i do that will i be able to just give the partition method and not opt for sorting the data for that link in the join stage.
I understand that the database could be on diferent nodes config , but will the sorting really matter here?
As in
1a) should i give mention in the join stage for the link where the data is coming from the database stage as sorted as hash
partition on the key fields and select the sort option and check the stable sort in there?
1b) or should i not select the sort option at all and set the APT_NO_SORT_INSERTION to true?

2) is it best to sort the data in the join stage itself?

Thanks in advance

chulett · Post by **chulett** » Wed Apr 28, 2010 1:46 pm

If you do the sorting in the database and you want to ensure it doesn't get sorted again after that, add an explicit Sort stage and set the option there to "Don't Sort, Already Sorted". Or, from what I understand, you could do "1b" instead but the former technique will be a little more... visible.

sshettar · Post by **sshettar** » Wed Apr 28, 2010 1:51 pm

Thanks Craig..
Just one more clarification,
the sort i do on the database will still be good in the process irrespective of the database being on 8 node and the datastage on 4 node?

Thanks in advance

chulett · Post by **chulett** » Wed Apr 28, 2010 2:05 pm

Depends on what you mean by 'node' from a database standpoint. Just make sure your sorting is appropriate for the partitioning you are doing in the job and you should be fine regardless of the 'nodes/partitions' of your database.

ray.wurlod · Post by **ray.wurlod** » Wed Apr 28, 2010 5:13 pm

Which machine has the most free resources?

In a future version (possibly the next version, which some IBMers are now calling 8.5) something called "dynamic optimization" is introduced, where you design a sort (and some other operations) and a dynamic decision is made by the software about pushing some or all of that processing into the database server; it's a cost-based optimization, based on what was presented at IOD 2009.

zulfi123786 · Post by **zulfi123786** » Fri May 07, 2010 9:30 pm

There was a small typing mistake in my earlier post, sorry for that :

I have a little confusion here as

The DB2 database stage runs sequentially and if i connect it to a join stage which runs in parallel the DataStage will partition the data across the nodes and when this happens i guess there is a chance for the sort order of the records to gets messed up resulting in a un-sorted data within each node (partition) and this is not what we would like to have in a join stage which requires sorted inputs so i guess it's always better to hash and sort in the join stage input link itself............

Gurus please share your thoughts over the above

Thanks

ray.wurlod · Post by **ray.wurlod** » Sat May 08, 2010 1:52 am

You must take control of what happens. Partitioning on the sort key will not destroy sorted order (think about it).

zulfi123786 · Post by **zulfi123786** » Sun May 09, 2010 11:44 pm

What i meant was if we have a stage that runs sequentially the data will be available on the conductor node (partition) [only if the stage runs parallely the data will be partitioned across available nodes else only the conductor partition will hold the data, this is what is presume] and suppose that the source data (sequential) which is sorted is now partitioned (hash partitioned) acorss 4 nodes then the data within each node i guess will no longer be in sorted order unless i set explicitly the inline sort.

If my above understanding is correct then i feel it's better we sort the data on the i/p link of the join rather than putting an order by in the database query.

ray.wurlod · Post by **ray.wurlod** » Mon May 10, 2010 12:22 am

As I said, think about it. Sketch it out with pencil and paper. Make sure you are partitioning on the sort key. This means that all rows with the same value of the sort key will be on the same partition as each other; all rows with a different value of the sort key will be on the same partition as each other, and so on. If the data happen to be pre-sorted, this will not be disrupted.

zulfi123786 · Post by **zulfi123786** » Mon May 10, 2010 12:42 am

Many Thanks Ray for you clarification
I was with the impression that when partitioning algorithm runs over sorted data it may disrupt the sort order of already sorted data which is now clear that i was worng.

ray.wurlod · Post by **ray.wurlod** » Mon May 10, 2010 4:00 pm

You're not necessarily wrong, or even worng.

If partitioning is on something other than the sort key, then chances are that it will disrupt sorting - though existing sort order per partition is likely to be preserved.

Again, sketch it out on paper or a whiteboard to understand what occurs.

zulfi123786 · Post by **zulfi123786** » Tue May 11, 2010 1:04 am

Throught the discussion i was talking about the partitioning over the sort key........