Doubt on partitioning

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
varsha16785
Participant
Posts: 4
Joined: Wed May 15, 2013 1:25 am

Doubt on partitioning

Post by varsha16785 »

I was reading this about entire partition :

"You might need to ensure that your lookup tables have been partitioned using the Entire method, so that the lookup tables will always contain the full set of data that might need to be looked up."

... which got me thinking, "Why?" I mean .. even if I partition my data on hash partitioning (or anyother partioning for that matter), DS will still have access to all data. It can still perform the look up ...Will it not?
jerome_rajan
Premium Member
Premium Member
Posts: 376
Joined: Sat Jan 07, 2012 12:25 pm
Location: Piscataway

Post by jerome_rajan »

Partitioning is not a prerequisite for the lookup stage. If you do hash partition the data on the reference link, then your data in the stream link would also have to be hash partitioned. Ideally you would use a lookup stage when your reference data is not very huge. But chances are that your input data is very large in which case your 'Hash Partitioning' strategy will become a bottleneck in the job. Since the reference data is not very huge, you would do well to leave the stream data to be partitioned the way datastage deems best (mostly Round Robin or Same) and do an entire partitioning on the reference link which would ensure that all the data is available to be looked up no matter how your input data is partitioned.
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn

Life is really simple, but we insist on making it complicated.
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

If the reference data is NOT Entire partitioned, then only certain reference records will be in certain partitions--this is the essence of partitioning--and the input dataset will need to be identically partitioned in order to match data together. The same is true of the Join and Merge stages. By using Entire partitioning on the reference dataset, you remove the requirement to repartition the input dataset as all reference records are available in all partitions. In earlier releases of DataStage, there were limitations on some platforms which prevented the use of Entire partition with extremely large reference datasets. In those situations, partitioning the reference and input datasets would become necessary. This rarely happens with current releases of DataStage and Information Server.

Do not come out of this thinking that Hash Partitioning is a bottleneck producer...it is actually efficient but obviously uses more cpu cycles than not repartitioning (the same can be said of ANY partitioning method). Poorly-chosen/designed partitioning strategies can be a bottleneck producer.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

Additional CPU cycles can be compensated with less IO on hash partitioned lookup, although it will be less as its loaded in memory. In Cluster or Grid i think key based partitioning should be considered rather than Entire. I would not say hash partitioning is bottleneck producer. James did explain it well in his post.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
Post Reply