Partioning in DataStage

pravin1581 · Post by **pravin1581** » Fri Aug 10, 2007 1:42 am

Hi All,

I have some doubts regarding partition. I was going through the different posts as well as the pdf , one thing that i noticed that for lookup the recommended method of partitioning is entire for the ref link and hash for the primary. My question if the volume of data is large then entire on the ref link will not affect the performance. Why are we not relying on other methods except entire and why this differential treatment in the case of lookup only , the same logic then should be applied to join and merge as well, as if the partitioning is not correct then the result may be erroneous.

Thanks in advance.

ray.wurlod · Post by **ray.wurlod** » Fri Aug 10, 2007 3:22 am

I think Entire is only recommended because it's idiot-proof.

The alternative for a Lookup stage is to guarantee that both/all inputs are identically partitioned based upon the "join" key(s).

For Join and Merge stages they must also be sorted on these keys.

For Merge stages with more than one update input, they must also be de-duplicated based on these keys.

pravin1581 · Post by **pravin1581** » Fri Aug 10, 2007 4:01 am

ray.wurlod wrote:I think Entire is only recommended because it's idiot-proof.

The alternative for a Lookup stage is to guarantee that both/all inputs are identically partitioned based upon the "join" key(s).

For Join and Merge stages they must also be sorted on these keys.

For Merge stages with more than one update input, they must also be de-duplicated based on these keys.

So entire is not the preferred way of doing it , but even the pdf suggests that.

gbusson · Post by **gbusson** » Fri Aug 10, 2007 9:34 am

If u're using a SMP, Entire means that u use shared memory to use the lookup.

It is not so bad.

bcarlson · Post by **bcarlson** » Mon Aug 13, 2007 10:35 am

Keep in mind that lookups are 'joins in memory' whereas Joins and Merges use scratch space, just like a large join in a database.

If you want speed and have smaller volumes, then use thge lookup and the 'join' will be done in memory. Using a lookup also means that you don't have to repartition your main data stream to match the join - this is an extra step that can be avoided, and one that uses I/O. Any place you can reduce I/O translates into better performance.

The entire partitioning method guarantees that all your reference data will match to anything in your source data without caring what node it is on or what partitioning method was used with your source data.

On the other hand, the join and merge are more scalable and can handle much higher volumes of data because they DON"T do it in memory. They use scratch space for temporary storage and are therefore not limited to the amount of memory allocated per node, but rather to the size of your file system. I would generally assume the filesystem has more space than your memory

.

With joins you have to ensure that both your input data streams are partitined identically on identically named/datatyped columns. In this case, you would normally NOT use entire partitioning because then you could end up with a cartesian product. You would be more likely to use the hash partitioning so that your join keys are always on the same nodes.

So, lookups are for low volumes, work in memory and use entire on the reference data to avoid repartioning your main data stream. Joins are for higher volumes, use filesystem 'scratch' space and use hash partitioning on both input streams to guarantee efficient matching on each node.

Hope this helps.

Brad.