Hash on Sort Stage

yxie · Post by **yxie** » Thu May 21, 2009 2:03 pm

Hi folks,

I have a job starting from two datasets, which are going to be joined based on two key, SEQ and DATE. Essentially I hash and sort them on sort stage before moving to join stage.
My doubt is since I already define SEQ and DATE as keys from both dataset, I am not sure if they are already been partitioned based on two keys from input datasets, can I ignore the hash?
Secondly, after above join stage, we have one more join stage only base on key SEQ, two input links one from previous join stage, other from the one of the original datasets ( I have a sort and a copy stage before inputing to each join), my question regarding this is do I have to hashed and sort each link on SEQ again before join(I guess that those two links are sorted and partitioned on SEQ and DATE).
Just wish to avoid repatition and understand theory better.

Appreciate any of your thoughts

Thanks in advance!

YXie

mikegohl · Post by **mikegohl** » Thu May 21, 2009 4:24 pm

Do you know what the partition and join keys are when the datasets were written? You can partion all Datasets by Seq from the start. This will avoid to repatition before the second join. You can still sort the data on Seq and Date.