Hash on Sort Stage
Posted: Thu May 21, 2009 2:03 pm
Hi folks,
I have a job starting from two datasets, which are going to be joined based on two key, SEQ and DATE. Essentially I hash and sort them on sort stage before moving to join stage.
My doubt is since I already define SEQ and DATE as keys from both dataset, I am not sure if they are already been partitioned based on two keys from input datasets, can I ignore the hash?
Secondly, after above join stage, we have one more join stage only base on key SEQ, two input links one from previous join stage, other from the one of the original datasets ( I have a sort and a copy stage before inputing to each join), my question regarding this is do I have to hashed and sort each link on SEQ again before join(I guess that those two links are sorted and partitioned on SEQ and DATE).
Just wish to avoid repatition and understand theory better.
Appreciate any of your thoughts
Thanks in advance!
YXie
I have a job starting from two datasets, which are going to be joined based on two key, SEQ and DATE. Essentially I hash and sort them on sort stage before moving to join stage.
My doubt is since I already define SEQ and DATE as keys from both dataset, I am not sure if they are already been partitioned based on two keys from input datasets, can I ignore the hash?
Secondly, after above join stage, we have one more join stage only base on key SEQ, two input links one from previous join stage, other from the one of the original datasets ( I have a sort and a copy stage before inputing to each join), my question regarding this is do I have to hashed and sort each link on SEQ again before join(I guess that those two links are sorted and partitioned on SEQ and DATE).
Just wish to avoid repatition and understand theory better.
Appreciate any of your thoughts
Thanks in advance!
YXie