Auto partitioning in join/merge stages

NigeGriff · Post by **NigeGriff** » Mon Sep 14, 2009 6:59 am

I have just read the enterprise developers guide section on join and merge stages and it states that if auto partitioning is selected datastage will automatically hash partition on the matching key and then sort the partitions using the matching key.

This feature has a productivity gain associated with as it negates the need to add a sort stage on each of the inputs to the join/merge stage to partition and sort the data.

I am intending to start using this feature going forward but before i do are there any risks/issues with using this feature?

priyadarshikunal · Post by **priyadarshikunal** » Mon Sep 14, 2009 8:25 am

It will work fine unless someone changes the environment variables
APT_NO_PART_INSERTION and APT_NO_SORT_INSERTION (or something like that don't remember the exact name) to true for some specific scenario (remove warnings).

ArndW · Post by **ArndW** » Mon Sep 14, 2009 8:56 am

I don't see a performance gain in this scenario, in fact there is a potential for having to re-sort data. If your join left & right streams are sorted and partitioned on the main join key then resorting and repartitioning isn't necessary. In addition, if the data is sorted on one key and not on another, then you can so specify in a sort stage ("data already sorted") for that one column and gain a lot of time, whereas the implicit sort stage has no such facility and would need to do a costly sort.

NigeGriff · Post by **NigeGriff** » Mon Sep 14, 2009 9:28 am

ArndW,

i was specifically talking about the scenario where the inputs to the join/merge did require partitioning and sorting and the productivity gain being not having to add sort stages on all the inputs because auto would performs these tasks autmoatically 'under the covers'.

ArndW · Post by **ArndW** » Mon Sep 14, 2009 9:43 am

I think I like the idea of using an explicit sort stage as it gives the designer more control (i.e. limiting the sort memory, choosing the type of sort) but with unsorted data that needs repartitioning there is, as you've said, no real difference.

ray.wurlod · Post by **ray.wurlod** » Mon Sep 14, 2009 5:23 pm

DataStage will insert hash partitioner and tsort operators for Join, Merge and certain other stage types, if (Auto) is selected and insertion is not prevented, and if there is no sorting specified on those input links. It doesn't matter that the data have already been sorted upstream; the stage itself has no way of knowing that. That's why I concur with the advice to use an explicit Sort stage, so that you can leverage its "don't sort, previously sorted" capability to truly gain performance advantage.

sshettar · Post by **sshettar** » Tue Sep 15, 2009 2:52 pm

Sorry if i am reopening the topic even if is it is not posted by me .
but since i had a similar doubt and i was going through this post , i'm doing so ...

Lets say i do a order by clause in my sql query to sort the columns needed to join and now pass this data through the sort stage where i partition the data based on joining columns and select dont sort previously sorted option for all the key columns.
but when i pass this data now through join stage and select same partitioning and not select any of sort option , will the join stage still insert the hash partitioner and tsort operators ?

ray.wurlod · Post by **ray.wurlod** » Tue Sep 15, 2009 4:52 pm

Unless there is explicit sorting on its inputs (namely a Sort stage or input link sort), the Join stage will have tsort operators inserted. Whether or not there is sorting does not affect whether partitioners are inserted; this is usually driven by whether or not the partitioning algorithm is set to Auto. Finally, because inputs to Join stage typically arrive at different rates, buffer operators are usually also inserted in Join stage inputs to avoid deadlock situations.

sshettar · Post by **sshettar** » Wed Sep 16, 2009 9:17 am

Now as i have in my example have 2 sort stages before the join stages (one for each input link to join stage) my understanding is join stage will not insert any tsort operators.

I'm sorry if i'm being really difficult in understanding.

If so then having explicit sort stages before the join stage would be better in temrs of performance as we would avoid the tsort operators being inserted?

Your advice and patience is truly appreciated

ArndW · Post by **ArndW** » Wed Sep 16, 2009 9:18 am

Yes, I prefer putting in an explicit sort stage before the join and explicitly sorting or stating "don't sort, already sorted" for each join key.

ray.wurlod · Post by **ray.wurlod** » Wed Sep 16, 2009 5:33 pm

You can inspect the score to determine whether operators are inserted.