Page 1 of 1

Join stage partition

Posted: Thu Jul 14, 2011 11:00 am
by kpsita
Hi,

I have a question regarding partition in join stage. My job design is to join two datasets. My question is, should I hash partition during this join in join stage. Because when we join two database stages the join stage will wait till all the records are read form the table and so we will get correct results. Is this the case with joining two datasets too?

Thanks

Posted: Thu Jul 14, 2011 6:31 pm
by jhmckeever
... when we join two database stages the join stage will wait till all the records are read form the table ...
This isn't true, unless you've got a sort somewhere in your job. The join will operate in a 'pipeline' fashion, regardless of whether its source data are provided by a database of dataset stage.

Posted: Fri Jul 15, 2011 8:32 am
by jim.paradies
This isn't true, unless you've got a sort somewhere in your job. The join will operate in a 'pipeline' fashion, regardless of whether its source data are provided by a database of dataset stage.
Joining data streams that are not pre-sorted on the join key will cause a tsort operator to be inserted in the input links if the auto partitioning method is used. In fact, the sort stage is sometimes used in a "Don't sort" mode simply to avoid re-sorting.

As to whether you need to partition, if you leave the partitioning method as auto, it should take care of itself.