How to decide which partition to be used in what kind of job

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
taral
Participant
Posts: 16
Joined: Fri Mar 26, 2010 1:41 am

How to decide which partition to be used in what kind of job

Post by taral »

We have different type of partition ie.
Hash Partition
Entire Prtition
Round robin
Same.

How can we decide which partition has to be used?
srinivas.g
Participant
Posts: 251
Joined: Mon Jun 09, 2008 5:52 am

Post by srinivas.g »

By default it is Auto.

join,merge ---> hash
lookup-->entire
Srinu Gadipudi
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Lookup-->Entire as a blanket statement? Yikes. To the OP - practice, experience and experimentation help.
-craig

"You can never have too many knives" -- Logan Nine Fingers
nagarjuna
Premium Member
Premium Member
Posts: 533
Joined: Fri Jun 27, 2008 9:11 pm
Location: Chicago

Post by nagarjuna »

It depends on type of req you are having .But decide whether its a keyed partitioning or non-key then as mentioned by craig experiment and decide .
Nag
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Welcome aboard.

There are actually eight choices for partitioning algorithm, and four for collecting. However, the decision is usually easier than that.

If you don't need to keep like-valued keys together, use an algorithm that spreads rows as evenly as possible over processing nodes. If you do need to keep like-valued keys together, use a key-based algorithm (modulus for a single integer key, hash otherwise). Range partitioning is rarely used, and requires that you pre-process your data to generate a "range map". Entire for reference input to Lookup stage is handy in that it guarantees that all valid lookups will succeed, but comes at a cost on cluster/grid environments in that all records have to be sent to all nodes (in an SMP environment one copy is lodged in shared memory).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply