suggest which typr of partition i use in PX job..

birajudesai · Post by **birajudesai** » Thu Nov 02, 2006 1:12 pm

Hello friend..
I have job in which i join 2 tables in PX job..
which partition type i use..
thanks,...

talk2shaanc · Post by **talk2shaanc** » Thu Nov 02, 2006 1:20 pm

Hash partition works best while performing a join as its key based...which makes it efficient, accurate and fast...

I would suggest you read below thread, it's worth looking for if you want to know more about joins..
viewtopic.php?t=104814

ray.wurlod · Post by **ray.wurlod** » Thu Nov 02, 2006 1:33 pm

If the keys are integer sequences, then Modulus becomes a candidate for most efficient partitioning algorithm.

If the two tables are in the same instance of the same database, a join performed in the database will almost certainly be more efficient than using any of the DataStage stages - the latter approach has to import all rows from both tables (subject to any other WHERE criteria) into virtual Data Sets before performing the combination of rows; if you do it in the database only the result of the join needs to be processed through DataStage.

samsuf2002 · Post by **samsuf2002** » Thu Nov 02, 2006 1:57 pm

hash partitioning would be a good choice

ray.wurlod · Post by **ray.wurlod** » Thu Nov 02, 2006 3:23 pm

... but modulus partitioning may be a better choice, as it consumes fewer resources.

helpmeout · Post by **helpmeout** » Thu Nov 02, 2006 3:51 pm

Is there rule that the joining key and partition key must be same in joiner stage..and what about differences among joiner,merger,lookup stages..

talk2shaanc · Post by **talk2shaanc** » Thu Nov 02, 2006 3:54 pm

Modulous may be using simpler method of calculation and may not be taking too many resource..But number of partition is totally dependent on number of nodes, number of partition will always be less than or equal to number of nodes.

assuming you have two nodes and in the incoming record 90% of the record are even number. Then you have one partition with 90% record and second partition with only 10%.

Even hash has similar problem. If you have 1 field as key and 90% record has same value all of them will be written to same partition. In such situation having multiple key is advised. The situation is rare, but it do happen.

That is why, I use Hash most of the time and recommend the same.

ray.wurlod · Post by **ray.wurlod** » Thu Nov 02, 2006 7:35 pm

Your scenario will not apply in an integer sequence which is what I specified.

tagnihotri · Post by **tagnihotri** » Mon Nov 06, 2006 7:05 am

I agree with Ray on this one! from what I have seen Modulus works great if used appropriately!! but if you want to play safe--- hash partition both streams ...

talk2shaanc · Post by **talk2shaanc** » Mon Nov 06, 2006 9:04 am

tagnihotri wrote:I agree with Ray on this one! from what I have seen Modulus works great if used appropriately!! but if you want to play safe--- hash partition both streams ...

As i mentioned before, modulous partition is totally dependent on number of nodes; number of partition will always be less than or equal to number of nodes. If you see the algorithm for Modulous, you will see that partition is also data dependent and after partitioning the data the distribution of data in each partition may vary.

if you have modulous partition on an integer field having data in "SEQUENCE", then yes I agree modulous would be a good candidate.

But if you are not aware of the distribution of data for the partition key or its random integer value. Then think and do lot of research before deciding on Modulous partioning.

ray.wurlod · Post by **ray.wurlod** » Mon Nov 06, 2006 1:23 pm

Hash partitioning is also totally dependent upon the number of nodes.

talk2shaanc · Post by **talk2shaanc** » Mon Nov 06, 2006 2:10 pm

ray.wurlod wrote:Hash partitioning is also totally dependent upon the number of nodes.

I don't think so.
For Hash partition hash function is used and in hash function, I don't think nodes are used. The number of partition is more data dependent. A generic definition of hash: The result of the hash algorithm, is a number between 0 and 1023, which is applied to the partitioning map to determine the node on which the record reside.

Whereas for modulous the data is assigned to a partition as per below formula:
partition_number =

Numeric field in the input

mod

number of processing nodes on which the partitioner executes

.
So, dependency is on nodes and data. There is a possibility that all the nodes are not utilized.

ray.wurlod · Post by **ray.wurlod** » Mon Nov 06, 2006 10:36 pm

Hash partitioning derives partitition number as:
partition = Mod(hashvalue(key),@NUMPARTITIONS)

hashvalue() is a function that generates an int32 result.

Therefore, hash partitioning is totally dependent on the number of partitions, as I previously stated.

Hashing is nothing more than a mechanism for choosing one from among a finite number of alternatives - in this case the number of available partitions.

tagnihotri · Post by **tagnihotri** » Mon Nov 06, 2006 11:18 pm

Hash paritioning also depends on number of nodes. The algo actualy sets the buckets depending on the number of nodes specified in the configuration file. Same as modulus

talk2shaanc wrote:
tagnihotri wrote:I agree with Ray on this one! from what I have seen Modulus works great if used appropriately!! but if you want to play safe--- hash partition both streams ...
As i mentioned before, modulous partition is totally dependent on number of nodes; number of partition will always be less than or equal to number of nodes. If you see the algorithm for Modulous, you will see that partition is also data dependent and after partitioning the data the distribution of data in each partition may vary.

if you have modulous partition on an integer field having data in "SEQUENCE", then yes I agree modulous would be a good candidate.

But if you are not aware of the distribution of data for the partition key or its random integer value. Then think and do lot of research before deciding on Modulous partioning.