suggest which typr of partition i use in PX job..

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
birajudesai
Participant
Posts: 1
Joined: Wed Oct 25, 2006 7:05 pm

suggest which typr of partition i use in PX job..

Post by birajudesai »

Hello friend..
I have job in which i join 2 tables in PX job..
which partition type i use..
thanks,...
talk2shaanc
Charter Member
Charter Member
Posts: 199
Joined: Tue Jan 18, 2005 2:50 am
Location: India

Post by talk2shaanc »

Hash partition works best while performing a join as its key based...which makes it efficient, accurate and fast...

I would suggest you read below thread, it's worth looking for if you want to know more about joins..
viewtopic.php?t=104814
Shantanu Choudhary
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If the keys are integer sequences, then Modulus becomes a candidate for most efficient partitioning algorithm.

If the two tables are in the same instance of the same database, a join performed in the database will almost certainly be more efficient than using any of the DataStage stages - the latter approach has to import all rows from both tables (subject to any other WHERE criteria) into virtual Data Sets before performing the combination of rows; if you do it in the database only the result of the join needs to be processed through DataStage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
samsuf2002
Premium Member
Premium Member
Posts: 397
Joined: Wed Apr 12, 2006 2:28 pm
Location: Tennesse

Post by samsuf2002 »

hash partitioning would be a good choice
hi sam here
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

... but modulus partitioning may be a better choice, as it consumes fewer resources.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
helpmeout
Participant
Posts: 23
Joined: Thu Nov 02, 2006 2:31 pm

Post by helpmeout »

Is there rule that the joining key and partition key must be same in joiner stage..and what about differences among joiner,merger,lookup stages..
talk2shaanc
Charter Member
Charter Member
Posts: 199
Joined: Tue Jan 18, 2005 2:50 am
Location: India

Post by talk2shaanc »

Modulous may be using simpler method of calculation and may not be taking too many resource..But number of partition is totally dependent on number of nodes, number of partition will always be less than or equal to number of nodes.

assuming you have two nodes and in the incoming record 90% of the record are even number. Then you have one partition with 90% record and second partition with only 10%.

Even hash has similar problem. If you have 1 field as key and 90% record has same value all of them will be written to same partition. In such situation having multiple key is advised. The situation is rare, but it do happen.

That is why, I use Hash most of the time and recommend the same.
Shantanu Choudhary
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Your scenario will not apply in an integer sequence which is what I specified.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
tagnihotri
Participant
Posts: 83
Joined: Sat Oct 28, 2006 6:25 am

Post by tagnihotri »

I agree with Ray on this one! from what I have seen Modulus works great if used appropriately!! but if you want to play safe--- hash partition both streams ...
talk2shaanc
Charter Member
Charter Member
Posts: 199
Joined: Tue Jan 18, 2005 2:50 am
Location: India

Post by talk2shaanc »

tagnihotri wrote:I agree with Ray on this one! from what I have seen Modulus works great if used appropriately!! but if you want to play safe--- hash partition both streams ...
As i mentioned before, modulous partition is totally dependent on number of nodes; number of partition will always be less than or equal to number of nodes. If you see the algorithm for Modulous, you will see that partition is also data dependent and after partitioning the data the distribution of data in each partition may vary.

if you have modulous partition on an integer field having data in "SEQUENCE", then yes I agree modulous would be a good candidate.

But if you are not aware of the distribution of data for the partition key or its random integer value. Then think and do lot of research before deciding on Modulous partioning.
Shantanu Choudhary
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Hash partitioning is also totally dependent upon the number of nodes.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
talk2shaanc
Charter Member
Charter Member
Posts: 199
Joined: Tue Jan 18, 2005 2:50 am
Location: India

Post by talk2shaanc »

ray.wurlod wrote:Hash partitioning is also totally dependent upon the number of nodes.
I don't think so.
For Hash partition hash function is used and in hash function, I don't think nodes are used. The number of partition is more data dependent. A generic definition of hash: The result of the hash algorithm, is a number between 0 and 1023, which is applied to the partitioning map to determine the node on which the record reside.

Whereas for modulous the data is assigned to a partition as per below formula:
partition_number =
Numeric field in the input
mod
number of processing nodes on which the partitioner executes
.
So, dependency is on nodes and data. There is a possibility that all the nodes are not utilized.
Shantanu Choudhary
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Hash partitioning derives partitition number as:
partition = Mod(hashvalue(key),@NUMPARTITIONS)

hashvalue() is a function that generates an int32 result.

Therefore, hash partitioning is totally dependent on the number of partitions, as I previously stated.

Hashing is nothing more than a mechanism for choosing one from among a finite number of alternatives - in this case the number of available partitions.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
tagnihotri
Participant
Posts: 83
Joined: Sat Oct 28, 2006 6:25 am

Post by tagnihotri »

Hash paritioning also depends on number of nodes. The algo actualy sets the buckets depending on the number of nodes specified in the configuration file. Same as modulus :-)
talk2shaanc wrote:
tagnihotri wrote:I agree with Ray on this one! from what I have seen Modulus works great if used appropriately!! but if you want to play safe--- hash partition both streams ...
As i mentioned before, modulous partition is totally dependent on number of nodes; number of partition will always be less than or equal to number of nodes. If you see the algorithm for Modulous, you will see that partition is also data dependent and after partitioning the data the distribution of data in each partition may vary.

if you have modulous partition on an integer field having data in "SEQUENCE", then yes I agree modulous would be a good candidate.

But if you are not aware of the distribution of data for the partition key or its random integer value. Then think and do lot of research before deciding on Modulous partioning.
Post Reply