Partioning in DataStage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
pravin1581
Premium Member
Premium Member
Posts: 497
Joined: Sun Dec 17, 2006 11:52 pm
Location: Kolkata
Contact:

Partioning in DataStage

Post by pravin1581 »

Hi All,

I have some doubts regarding partition. I was going through the different posts as well as the pdf , one thing that i noticed that for lookup the recommended method of partitioning is entire for the ref link and hash for the primary. My question if the volume of data is large then entire on the ref link will not affect the performance. Why are we not relying on other methods except entire and why this differential treatment in the case of lookup only , the same logic then should be applied to join and merge as well, as if the partitioning is not correct then the result may be erroneous.

Thanks in advance.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I think Entire is only recommended because it's idiot-proof.

The alternative for a Lookup stage is to guarantee that both/all inputs are identically partitioned based upon the "join" key(s).

For Join and Merge stages they must also be sorted on these keys.

For Merge stages with more than one update input, they must also be de-duplicated based on these keys.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
pravin1581
Premium Member
Premium Member
Posts: 497
Joined: Sun Dec 17, 2006 11:52 pm
Location: Kolkata
Contact:

Post by pravin1581 »

ray.wurlod wrote:I think Entire is only recommended because it's idiot-proof.

The alternative for a Lookup stage is to guarantee that both/all inputs are identically partitioned based upon the "join" key(s).

For Join and Merge stages they must also be sorted on these keys.

For Merge stages with more than one update input, they must also be de-duplicated based on these keys.
So entire is not the preferred way of doing it , but even the pdf suggests that.
gbusson
Participant
Posts: 98
Joined: Fri Oct 07, 2005 2:50 am
Location: France
Contact:

Post by gbusson »

If u're using a SMP, Entire means that u use shared memory to use the lookup.

It is not so bad.
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

Keep in mind that lookups are 'joins in memory' whereas Joins and Merges use scratch space, just like a large join in a database.

If you want speed and have smaller volumes, then use thge lookup and the 'join' will be done in memory. Using a lookup also means that you don't have to repartition your main data stream to match the join - this is an extra step that can be avoided, and one that uses I/O. Any place you can reduce I/O translates into better performance.

The entire partitioning method guarantees that all your reference data will match to anything in your source data without caring what node it is on or what partitioning method was used with your source data.

On the other hand, the join and merge are more scalable and can handle much higher volumes of data because they DON"T do it in memory. They use scratch space for temporary storage and are therefore not limited to the amount of memory allocated per node, but rather to the size of your file system. I would generally assume the filesystem has more space than your memory :).

With joins you have to ensure that both your input data streams are partitined identically on identically named/datatyped columns. In this case, you would normally NOT use entire partitioning because then you could end up with a cartesian product. You would be more likely to use the hash partitioning so that your join keys are always on the same nodes.

So, lookups are for low volumes, work in memory and use entire on the reference data to avoid repartioning your main data stream. Joins are for higher volumes, use filesystem 'scratch' space and use hash partitioning on both input streams to guarantee efficient matching on each node.


Hope this helps.

Brad.
Post Reply