Join stage partition

kpsita · Post by **kpsita** » Thu Jul 14, 2011 11:00 am

Hi,

I have a question regarding partition in join stage. My job design is to join two datasets. My question is, should I hash partition during this join in join stage. Because when we join two database stages the join stage will wait till all the records are read form the table and so we will get correct results. Is this the case with joining two datasets too?

Thanks

jhmckeever · Post by **jhmckeever** » Thu Jul 14, 2011 6:31 pm

... when we join two database stages the join stage will wait till all the records are read form the table ...

This isn't true, unless you've got a sort somewhere in your job. The join will operate in a 'pipeline' fashion, regardless of whether its source data are provided by a database of dataset stage.

jim.paradies · Post by **jim.paradies** » Fri Jul 15, 2011 8:32 am

This isn't true, unless you've got a sort somewhere in your job. The join will operate in a 'pipeline' fashion, regardless of whether its source data are provided by a database of dataset stage.

Joining data streams that are not pre-sorted on the join key will cause a tsort operator to be inserted in the input links if the auto partitioning method is used. In fact, the sort stage is sometimes used in a "Don't sort" mode simply to avoid re-sorting.

As to whether you need to partition, if you leave the partitioning method as auto, it should take care of itself.