Hi ,
In my job two stages - remove duplicate and join stage are placed side by side.
I have done key based hash partitioning for the first stage (remove duplicate).The key for the first stage(remove duplicate) is columns A and B.For the next join stage the key is B.
My query is do I need to again repartition the data in join stage on column B or I can go with "same" partitioning in the join stage as data is already key partitioned in the previous stage on column A. B and B is subset of A,B ?
Thanks and Regards
Avik Dasgupta
Hash partitioning on the same subset key
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 42
- Joined: Fri Oct 20, 2006 1:58 am
Re: Hash partitioning on the same subset key
Hi adasgupta123,
You could only use same partitioning if the second input-stream to your Join-Stage also contained column A and was also hash-partitioned by columns A and B. But then You could also keep the Join-key as A and B.
Since Your second stream probably does not have column A, You will have to repartition stream 1 to get identical values on column B into the same partitions for both streams.
You could only use same partitioning if the second input-stream to your Join-Stage also contained column A and was also hash-partitioned by columns A and B. But then You could also keep the Join-key as A and B.
Since Your second stream probably does not have column A, You will have to repartition stream 1 to get identical values on column B into the same partitions for both streams.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
There are the grateful those are happy." Francis Bacon
-
- Participant
- Posts: 42
- Joined: Fri Oct 20, 2006 1:58 am
Hi Roland,
Thanks for your explanation .I got your point.
There is another similar scenario ,the only difference is the second stage is aggregator stage.That means the remove duplicate and aggregator stage are placed side by side .The key for the first stage is A,B columns and for second stage is B.The first stage is key partitioned on A,B.I think as the second stage (aggregator) is having single input link and there is no matching opearation like join ,we can go ahead with same partition for the second stage.Please correct me if I am wrong.Looking for your advice .
Thanking you
Avik
Thanks for your explanation .I got your point.
There is another similar scenario ,the only difference is the second stage is aggregator stage.That means the remove duplicate and aggregator stage are placed side by side .The key for the first stage is A,B columns and for second stage is B.The first stage is key partitioned on A,B.I think as the second stage (aggregator) is having single input link and there is no matching opearation like join ,we can go ahead with same partition for the second stage.Please correct me if I am wrong.Looking for your advice .
Thanking you
Avik
-
- Participant
- Posts: 42
- Joined: Fri Oct 20, 2006 1:58 am