Doubt on partitioning

varsha16785 · Post by **varsha16785** » Fri May 17, 2013 2:24 am

I was reading this about entire partition :

"You might need to ensure that your lookup tables have been partitioned using the Entire method, so that the lookup tables will always contain the full set of data that might need to be looked up."

... which got me thinking, "Why?" I mean .. even if I partition my data on hash partitioning (or anyother partioning for that matter), DS will still have access to all data. It can still perform the look up ...Will it not?

jerome_rajan · Post by **jerome_rajan** » Fri May 17, 2013 2:45 am

Partitioning is not a prerequisite for the lookup stage. If you do hash partition the data on the reference link, then your data in the stream link would also have to be hash partitioned. Ideally you would use a lookup stage when your reference data is not very huge. But chances are that your input data is very large in which case your 'Hash Partitioning' strategy will become a bottleneck in the job. Since the reference data is not very huge, you would do well to leave the stream data to be partitioned the way datastage deems best (mostly Round Robin or Same) and do an entire partitioning on the reference link which would ensure that all the data is available to be looked up no matter how your input data is partitioned.

jwiles · Post by **jwiles** » Fri May 17, 2013 9:18 am

If the reference data is NOT Entire partitioned, then only certain reference records will be in certain partitions--this is the essence of partitioning--and the input dataset will need to be identically partitioned in order to match data together. The same is true of the Join and Merge stages. By using Entire partitioning on the reference dataset, you remove the requirement to repartition the input dataset as all reference records are available in all partitions. In earlier releases of DataStage, there were limitations on some platforms which prevented the use of Entire partition with extremely large reference datasets. In those situations, partitioning the reference and input datasets would become necessary. This rarely happens with current releases of DataStage and Information Server.

Do not come out of this thinking that Hash Partitioning is a bottleneck producer...it is actually efficient but obviously uses more cpu cycles than not repartitioning (the same can be said of ANY partitioning method). Poorly-chosen/designed partitioning strategies can be a bottleneck producer.

Regards,

priyadarshikunal · Post by **priyadarshikunal** » Fri May 17, 2013 10:22 am

Additional CPU cycles can be compensated with less IO on hash partitioned lookup, although it will be less as its loaded in memory. In Cluster or Grid i think key based partitioning should be considered rather than Entire. I would not say hash partitioning is bottleneck producer. James did explain it well in his post.