suggest which typr of partition i use in PX job..
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 1
- Joined: Wed Oct 25, 2006 7:05 pm
suggest which typr of partition i use in PX job..
Hello friend..
I have job in which i join 2 tables in PX job..
which partition type i use..
thanks,...
I have job in which i join 2 tables in PX job..
which partition type i use..
thanks,...
-
- Charter Member
- Posts: 199
- Joined: Tue Jan 18, 2005 2:50 am
- Location: India
Hash partition works best while performing a join as its key based...which makes it efficient, accurate and fast...
I would suggest you read below thread, it's worth looking for if you want to know more about joins..
viewtopic.php?t=104814
I would suggest you read below thread, it's worth looking for if you want to know more about joins..
viewtopic.php?t=104814
Shantanu Choudhary
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
If the keys are integer sequences, then Modulus becomes a candidate for most efficient partitioning algorithm.
If the two tables are in the same instance of the same database, a join performed in the database will almost certainly be more efficient than using any of the DataStage stages - the latter approach has to import all rows from both tables (subject to any other WHERE criteria) into virtual Data Sets before performing the combination of rows; if you do it in the database only the result of the join needs to be processed through DataStage.
If the two tables are in the same instance of the same database, a join performed in the database will almost certainly be more efficient than using any of the DataStage stages - the latter approach has to import all rows from both tables (subject to any other WHERE criteria) into virtual Data Sets before performing the combination of rows; if you do it in the database only the result of the join needs to be processed through DataStage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 397
- Joined: Wed Apr 12, 2006 2:28 pm
- Location: Tennesse
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Charter Member
- Posts: 199
- Joined: Tue Jan 18, 2005 2:50 am
- Location: India
Modulous may be using simpler method of calculation and may not be taking too many resource..But number of partition is totally dependent on number of nodes, number of partition will always be less than or equal to number of nodes.
assuming you have two nodes and in the incoming record 90% of the record are even number. Then you have one partition with 90% record and second partition with only 10%.
Even hash has similar problem. If you have 1 field as key and 90% record has same value all of them will be written to same partition. In such situation having multiple key is advised. The situation is rare, but it do happen.
That is why, I use Hash most of the time and recommend the same.
assuming you have two nodes and in the incoming record 90% of the record are even number. Then you have one partition with 90% record and second partition with only 10%.
Even hash has similar problem. If you have 1 field as key and 90% record has same value all of them will be written to same partition. In such situation having multiple key is advised. The situation is rare, but it do happen.
That is why, I use Hash most of the time and recommend the same.
Shantanu Choudhary
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Participant
- Posts: 83
- Joined: Sat Oct 28, 2006 6:25 am
-
- Charter Member
- Posts: 199
- Joined: Tue Jan 18, 2005 2:50 am
- Location: India
As i mentioned before, modulous partition is totally dependent on number of nodes; number of partition will always be less than or equal to number of nodes. If you see the algorithm for Modulous, you will see that partition is also data dependent and after partitioning the data the distribution of data in each partition may vary.tagnihotri wrote:I agree with Ray on this one! from what I have seen Modulus works great if used appropriately!! but if you want to play safe--- hash partition both streams ...
if you have modulous partition on an integer field having data in "SEQUENCE", then yes I agree modulous would be a good candidate.
But if you are not aware of the distribution of data for the partition key or its random integer value. Then think and do lot of research before deciding on Modulous partioning.
Shantanu Choudhary
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Charter Member
- Posts: 199
- Joined: Tue Jan 18, 2005 2:50 am
- Location: India
I don't think so.ray.wurlod wrote:Hash partitioning is also totally dependent upon the number of nodes.
For Hash partition hash function is used and in hash function, I don't think nodes are used. The number of partition is more data dependent. A generic definition of hash: The result of the hash algorithm, is a number between 0 and 1023, which is applied to the partitioning map to determine the node on which the record reside.
Whereas for modulous the data is assigned to a partition as per below formula:
partition_number =
modNumeric field in the input
.number of processing nodes on which the partitioner executes
So, dependency is on nodes and data. There is a possibility that all the nodes are not utilized.
Shantanu Choudhary
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Hash partitioning derives partitition number as:
partition = Mod(hashvalue(key),@NUMPARTITIONS)
hashvalue() is a function that generates an int32 result.
Therefore, hash partitioning is totally dependent on the number of partitions, as I previously stated.
Hashing is nothing more than a mechanism for choosing one from among a finite number of alternatives - in this case the number of available partitions.
partition = Mod(hashvalue(key),@NUMPARTITIONS)
hashvalue() is a function that generates an int32 result.
Therefore, hash partitioning is totally dependent on the number of partitions, as I previously stated.
Hashing is nothing more than a mechanism for choosing one from among a finite number of alternatives - in this case the number of available partitions.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 83
- Joined: Sat Oct 28, 2006 6:25 am
Hash paritioning also depends on number of nodes. The algo actualy sets the buckets depending on the number of nodes specified in the configuration file. Same as modulus
talk2shaanc wrote:As i mentioned before, modulous partition is totally dependent on number of nodes; number of partition will always be less than or equal to number of nodes. If you see the algorithm for Modulous, you will see that partition is also data dependent and after partitioning the data the distribution of data in each partition may vary.tagnihotri wrote:I agree with Ray on this one! from what I have seen Modulus works great if used appropriately!! but if you want to play safe--- hash partition both streams ...
if you have modulous partition on an integer field having data in "SEQUENCE", then yes I agree modulous would be a good candidate.
But if you are not aware of the distribution of data for the partition key or its random integer value. Then think and do lot of research before deciding on Modulous partioning.