Hash Partitioning
Moderators: chulett, rschirm, roy
-
- Charter Member
- Posts: 87
- Joined: Sat Mar 18, 2006 11:45 pm
Hash Partitioning
Hi
Suppose we have records like
zip,age,apt
48201,20,2
48201,20,1
77058,60,2
77058,20,2
48201,30,5
85674,35,90
77058,10,60
48201,60,30
If we partition the data using all three keys in order zip,age,apt on two nodes ,does all records with zip 48201 arrive on the same node always.
Thanks
Arif
Suppose we have records like
zip,age,apt
48201,20,2
48201,20,1
77058,60,2
77058,20,2
48201,30,5
85674,35,90
77058,10,60
48201,60,30
If we partition the data using all three keys in order zip,age,apt on two nodes ,does all records with zip 48201 arrive on the same node always.
Thanks
Arif
Yes. But...
Why partition on all three? Or is this just an example? All partitioning is used to do is distribute data evenly across multiple independent worker processes. Hash partitioning means that like values stay together, which introduces some skewing as a 99K out of 100K rows may partition together, causing one node to do almost all of the work. A round-robin partitioning means that all nodes get the same number of rows.
In your example, I couldn't see why you need to partition on all three, seems strange.
Why partition on all three? Or is this just an example? All partitioning is used to do is distribute data evenly across multiple independent worker processes. Hash partitioning means that like values stay together, which introduces some skewing as a 99K out of 100K rows may partition together, causing one node to do almost all of the work. A round-robin partitioning means that all nodes get the same number of rows.
In your example, I couldn't see why you need to partition on all three, seems strange.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
-
- Charter Member
- Posts: 87
- Joined: Sat Mar 18, 2006 11:45 pm
Okkcbland wrote:Yes. But...
Why partition on all three? Or is this just an example? All partitioning is used to do is distribute data evenly across multiple independent worker processes. Hash partitioning means that like values stay together, which introduces some skewing as a 99K out of 100K rows may partition together, causing one node to do almost all of the work. A round-robin partitioning means that all nodes get the same number of rows.
In your example, I couldn't see why you need to partition on all three, seems strange.
Here is my requirement ,I need to perform remove duplicate operation on zip and age but the duplicate which should be retained should have a higher value of apartment number .
I was trying to hash partition and sort on all three keys and remove duplicate on first two keys with to retain option as last.I am not sure if all the records with same zip and age arrive on same partition ,or is there some other way to accomplish it.
Thanks
Arif
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
It would be sufficient to hash (or modulus) on zip.
Sorting on the other two columns is beneficial to the correct operation on the Remove Duplicates stage, but all you need to ensure is that each distinct value of zip (the grouping column) appears on exactly one node.
Round robin partitioning will not achieve this, as it is purely based on row number.
Sorting on the other two columns is beneficial to the correct operation on the Remove Duplicates stage, but all you need to ensure is that each distinct value of zip (the grouping column) appears on exactly one node.
Round robin partitioning will not achieve this, as it is purely based on row number.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Not necessarily....If we partition the data using all three keys in order zip,age,apt on two nodes ,does all records with zip 48201 arrive on the same node always.
If you mark all the three column as key for hash partition, the partiton will be done on the accumulation of all the three columns. Hence might have the chance of zip - 48201 to be distributed across two nodes.
Unless the hash partiton only on zip will not ensure your condition.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
-
- Charter Member
- Posts: 87
- Joined: Sat Mar 18, 2006 11:45 pm
I tried sorting by apt first using sort stage then perform a remove duplicate by repartitioning the data on zip and age and sorting it,everything works fine but I am unable retain the record with higher apt.ray.wurlod wrote:It would be sufficient to hash (or modulus) on zip.
Sorting on the other two columns is beneficial to the correct operation on the Remove Duplicates stage, but all you need to ensure is that each distinct value of zip (the grouping column) appears on exactly one node.
.
Thanks
Arif
For your requirement, you will need to hash partition on 2 keys but sort on all 3 keys. Use the correct sort order for the third key or use Duplicate to retain = 'last' . After this stage you can use remove duplicates with only the first 2 keys.
If the sort is done only on 2 keys, you will not be able to ensure that the highest value of third key is the one that is retained.
Hope this helps.
If the sort is done only on 2 keys, you will not be able to ensure that the highest value of third key is the one that is retained.
Hope this helps.
Regards,
Bakul
Bakul