Auto partitioning in join/merge stages
Moderators: chulett, rschirm, roy
Auto partitioning in join/merge stages
I have just read the enterprise developers guide section on join and merge stages and it states that if auto partitioning is selected datastage will automatically hash partition on the matching key and then sort the partitions using the matching key.
This feature has a productivity gain associated with as it negates the need to add a sort stage on each of the inputs to the join/merge stage to partition and sort the data.
I am intending to start using this feature going forward but before i do are there any risks/issues with using this feature?
This feature has a productivity gain associated with as it negates the need to add a sort stage on each of the inputs to the join/merge stage to partition and sort the data.
I am intending to start using this feature going forward but before i do are there any risks/issues with using this feature?
Thanks
Nigel
Nigel
-
- Premium Member
- Posts: 1735
- Joined: Thu Mar 01, 2007 5:44 am
- Location: Troy, MI
It will work fine unless someone changes the environment variables
APT_NO_PART_INSERTION and APT_NO_SORT_INSERTION (or something like that don't remember the exact name) to true for some specific scenario (remove warnings).
APT_NO_PART_INSERTION and APT_NO_SORT_INSERTION (or something like that don't remember the exact name) to true for some specific scenario (remove warnings).
Priyadarshi Kunal
Genius may have its limitations, but stupidity is not thus handicapped.
Genius may have its limitations, but stupidity is not thus handicapped.
I don't see a performance gain in this scenario, in fact there is a potential for having to re-sort data. If your join left & right streams are sorted and partitioned on the main join key then resorting and repartitioning isn't necessary. In addition, if the data is sorted on one key and not on another, then you can so specify in a sort stage ("data already sorted") for that one column and gain a lot of time, whereas the implicit sort stage has no such facility and would need to do a costly sort.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
I think I like the idea of using an explicit sort stage as it gives the designer more control (i.e. limiting the sort memory, choosing the type of sort) but with unsorted data that needs repartitioning there is, as you've said, no real difference.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
DataStage will insert hash partitioner and tsort operators for Join, Merge and certain other stage types, if (Auto) is selected and insertion is not prevented, and if there is no sorting specified on those input links. It doesn't matter that the data have already been sorted upstream; the stage itself has no way of knowing that. That's why I concur with the advice to use an explicit Sort stage, so that you can leverage its "don't sort, previously sorted" capability to truly gain performance advantage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Sorry if i am reopening the topic even if is it is not posted by me .
but since i had a similar doubt and i was going through this post , i'm doing so ...
Lets say i do a order by clause in my sql query to sort the columns needed to join and now pass this data through the sort stage where i partition the data based on joining columns and select dont sort previously sorted option for all the key columns.
but when i pass this data now through join stage and select same partitioning and not select any of sort option , will the join stage still insert the hash partitioner and tsort operators ?
but since i had a similar doubt and i was going through this post , i'm doing so ...
Lets say i do a order by clause in my sql query to sort the columns needed to join and now pass this data through the sort stage where i partition the data based on joining columns and select dont sort previously sorted option for all the key columns.
but when i pass this data now through join stage and select same partitioning and not select any of sort option , will the join stage still insert the hash partitioner and tsort operators ?
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Unless there is explicit sorting on its inputs (namely a Sort stage or input link sort), the Join stage will have tsort operators inserted. Whether or not there is sorting does not affect whether partitioners are inserted; this is usually driven by whether or not the partitioning algorithm is set to Auto. Finally, because inputs to Join stage typically arrive at different rates, buffer operators are usually also inserted in Join stage inputs to avoid deadlock situations.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Now as i have in my example have 2 sort stages before the join stages (one for each input link to join stage) my understanding is join stage will not insert any tsort operators.
I'm sorry if i'm being really difficult in understanding.
If so then having explicit sort stages before the join stage would be better in temrs of performance as we would avoid the tsort operators being inserted?
Your advice and patience is truly appreciated
I'm sorry if i'm being really difficult in understanding.
If so then having explicit sort stages before the join stage would be better in temrs of performance as we would avoid the tsort operators being inserted?
Your advice and patience is truly appreciated
Yes, I prefer putting in an explicit sort stage before the join and explicitly sorting or stating "don't sort, already sorted" for each join key.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact: