Page 1 of 1

Hash partitioning on the same subset key

Posted: Thu Sep 15, 2011 3:41 am
by adasgupta123
Hi ,

In my job two stages - remove duplicate and join stage are placed side by side.

I have done key based hash partitioning for the first stage (remove duplicate).The key for the first stage(remove duplicate) is columns A and B.For the next join stage the key is B.

My query is do I need to again repartition the data in join stage on column B or I can go with "same" partitioning in the join stage as data is already key partitioned in the previous stage on column A. B and B is subset of A,B ?

Thanks and Regards

Avik Dasgupta

Re: Hash partitioning on the same subset key

Posted: Thu Sep 15, 2011 4:35 am
by BI-RMA
Hi adasgupta123,

You could only use same partitioning if the second input-stream to your Join-Stage also contained column A and was also hash-partitioned by columns A and B. But then You could also keep the Join-key as A and B.

Since Your second stream probably does not have column A, You will have to repartition stream 1 to get identical values on column B into the same partitions for both streams.

Posted: Thu Sep 15, 2011 5:45 am
by adasgupta123
Hi Roland,

Thanks for your explanation .I got your point.

There is another similar scenario ,the only difference is the second stage is aggregator stage.That means the remove duplicate and aggregator stage are placed side by side .The key for the first stage is A,B columns and for second stage is B.The first stage is key partitioned on A,B.I think as the second stage (aggregator) is having single input link and there is no matching opearation like join ,we can go ahead with same partition for the second stage.Please correct me if I am wrong.Looking for your advice .

Thanking you

Avik

Posted: Thu Sep 15, 2011 5:51 am
by BI-RMA
Hi Avik,

correct. In this case all values for identical values on Column B will be in the same partition without repartitioning.

Posted: Thu Sep 15, 2011 5:59 am
by adasgupta123
Hi Roland ,

Thanks a lot

Regards

Avik