Auto partitioning in join/merge stages

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
NigeGriff
Premium Member
Premium Member
Posts: 46
Joined: Mon Nov 24, 2003 5:46 am

Auto partitioning in join/merge stages

Post by NigeGriff »

I have just read the enterprise developers guide section on join and merge stages and it states that if auto partitioning is selected datastage will automatically hash partition on the matching key and then sort the partitions using the matching key.

This feature has a productivity gain associated with as it negates the need to add a sort stage on each of the inputs to the join/merge stage to partition and sort the data.

I am intending to start using this feature going forward but before i do are there any risks/issues with using this feature?
Thanks
Nigel
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

It will work fine unless someone changes the environment variables
APT_NO_PART_INSERTION and APT_NO_SORT_INSERTION (or something like that don't remember the exact name) to true for some specific scenario (remove warnings).
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I don't see a performance gain in this scenario, in fact there is a potential for having to re-sort data. If your join left & right streams are sorted and partitioned on the main join key then resorting and repartitioning isn't necessary. In addition, if the data is sorted on one key and not on another, then you can so specify in a sort stage ("data already sorted") for that one column and gain a lot of time, whereas the implicit sort stage has no such facility and would need to do a costly sort.
NigeGriff
Premium Member
Premium Member
Posts: 46
Joined: Mon Nov 24, 2003 5:46 am

Post by NigeGriff »

ArndW,

i was specifically talking about the scenario where the inputs to the join/merge did require partitioning and sorting and the productivity gain being not having to add sort stages on all the inputs because auto would performs these tasks autmoatically 'under the covers'.
Thanks
Nigel
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I think I like the idea of using an explicit sort stage as it gives the designer more control (i.e. limiting the sort memory, choosing the type of sort) but with unsorted data that needs repartitioning there is, as you've said, no real difference.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

DataStage will insert hash partitioner and tsort operators for Join, Merge and certain other stage types, if (Auto) is selected and insertion is not prevented, and if there is no sorting specified on those input links. It doesn't matter that the data have already been sorted upstream; the stage itself has no way of knowing that. That's why I concur with the advice to use an explicit Sort stage, so that you can leverage its "don't sort, previously sorted" capability to truly gain performance advantage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sshettar
Premium Member
Premium Member
Posts: 264
Joined: Thu Nov 30, 2006 10:37 am

Post by sshettar »

Sorry if i am reopening the topic even if is it is not posted by me .
but since i had a similar doubt and i was going through this post , i'm doing so ...

Lets say i do a order by clause in my sql query to sort the columns needed to join and now pass this data through the sort stage where i partition the data based on joining columns and select dont sort previously sorted option for all the key columns.
but when i pass this data now through join stage and select same partitioning and not select any of sort option , will the join stage still insert the hash partitioner and tsort operators ?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Unless there is explicit sorting on its inputs (namely a Sort stage or input link sort), the Join stage will have tsort operators inserted. Whether or not there is sorting does not affect whether partitioners are inserted; this is usually driven by whether or not the partitioning algorithm is set to Auto. Finally, because inputs to Join stage typically arrive at different rates, buffer operators are usually also inserted in Join stage inputs to avoid deadlock situations.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sshettar
Premium Member
Premium Member
Posts: 264
Joined: Thu Nov 30, 2006 10:37 am

Post by sshettar »

Now as i have in my example have 2 sort stages before the join stages (one for each input link to join stage) my understanding is join stage will not insert any tsort operators.

I'm sorry if i'm being really difficult in understanding.


If so then having explicit sort stages before the join stage would be better in temrs of performance as we would avoid the tsort operators being inserted?

Your advice and patience is truly appreciated
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Yes, I prefer putting in an explicit sort stage before the join and explicitly sorting or stating "don't sort, already sorted" for each join key.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You can inspect the score to determine whether operators are inserted.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply