Sorting in Database Vs Datastage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
sshettar
Premium Member
Premium Member
Posts: 264
Joined: Thu Nov 30, 2006 10:37 am

Sorting in Database Vs Datastage

Post by sshettar »

Hi All,

I need to have a small clarification,
i have this job where in there are several join stages ( about 7 join stages)
the first join stage has like 7 input links while the remianing join stages just have 2 input links.

We are running all the jobs on 4 nodes.
Since there are soo many join stages , i wanted to optimize the job .
I wanted to sort the data as mch as possible in the database stages.
but when i do that will i be able to just give the partition method and not opt for sorting the data for that link in the join stage.
I understand that the database could be on diferent nodes config , but will the sorting really matter here?
As in
1a) should i give mention in the join stage for the link where the data is coming from the database stage as sorted as hash
partition on the key fields and select the sort option and check the stable sort in there?
1b) or should i not select the sort option at all and set the APT_NO_SORT_INSERTION to true?


2) is it best to sort the data in the join stage itself?

Thanks in advance
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

If you do the sorting in the database and you want to ensure it doesn't get sorted again after that, add an explicit Sort stage and set the option there to "Don't Sort, Already Sorted". Or, from what I understand, you could do "1b" instead but the former technique will be a little more... visible.
-craig

"You can never have too many knives" -- Logan Nine Fingers
sshettar
Premium Member
Premium Member
Posts: 264
Joined: Thu Nov 30, 2006 10:37 am

Post by sshettar »

Thanks Craig..
Just one more clarification,
the sort i do on the database will still be good in the process irrespective of the database being on 8 node and the datastage on 4 node?

Thanks in advance
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Depends on what you mean by 'node' from a database standpoint. Just make sure your sorting is appropriate for the partitioning you are doing in the job and you should be fine regardless of the 'nodes/partitions' of your database.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Which machine has the most free resources?

In a future version (possibly the next version, which some IBMers are now calling 8.5) something called "dynamic optimization" is introduced, where you design a sort (and some other operations) and a dynamic decision is made by the software about pushing some or all of that processing into the database server; it's a cost-based optimization, based on what was presented at IOD 2009.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

Post by zulfi123786 »

There was a small typing mistake in my earlier post, sorry for that :

I have a little confusion here as

The DB2 database stage runs sequentially and if i connect it to a join stage which runs in parallel the DataStage will partition the data across the nodes and when this happens i guess there is a chance for the sort order of the records to gets messed up resulting in a un-sorted data within each node (partition) and this is not what we would like to have in a join stage which requires sorted inputs so i guess it's always better to hash and sort in the join stage input link itself............

Gurus please share your thoughts over the above

Thanks
Last edited by zulfi123786 on Sun May 09, 2010 11:36 pm, edited 1 time in total.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You must take control of what happens. Partitioning on the sort key will not destroy sorted order (think about it).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

Post by zulfi123786 »

What i meant was if we have a stage that runs sequentially the data will be available on the conductor node (partition) [only if the stage runs parallely the data will be partitioned across available nodes else only the conductor partition will hold the data, this is what is presume] and suppose that the source data (sequential) which is sorted is now partitioned (hash partitioned) acorss 4 nodes then the data within each node i guess will no longer be in sorted order unless i set explicitly the inline sort.

If my above understanding is correct then i feel it's better we sort the data on the i/p link of the join rather than putting an order by in the database query.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

As I said, think about it. Sketch it out with pencil and paper. Make sure you are partitioning on the sort key. This means that all rows with the same value of the sort key will be on the same partition as each other; all rows with a different value of the sort key will be on the same partition as each other, and so on. If the data happen to be pre-sorted, this will not be disrupted.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

Post by zulfi123786 »

Many Thanks Ray for you clarification
I was with the impression that when partitioning algorithm runs over sorted data it may disrupt the sort order of already sorted data which is now clear that i was worng.
:)
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You're not necessarily wrong, or even worng. :lol:

If partitioning is on something other than the sort key, then chances are that it will disrupt sorting - though existing sort order per partition is likely to be preserved.

Again, sketch it out on paper or a whiteboard to understand what occurs.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

Post by zulfi123786 »

Throught the discussion i was talking about the partitioning over the sort key........ :)
Post Reply