Hash Partitioning

mab_arif16 · Post by **mab_arif16** » Tue Jun 27, 2006 5:22 pm

Hi
Suppose we have records like

zip,age,apt
48201,20,2
48201,20,1
77058,60,2
77058,20,2
48201,30,5
85674,35,90
77058,10,60
48201,60,30

If we partition the data using all three keys in order zip,age,apt on two nodes ,does all records with zip 48201 arrive on the same node always.

Thanks
Arif

kcbland · Post by **kcbland** » Tue Jun 27, 2006 7:54 pm

Yes. But...

Why partition on all three? Or is this just an example? All partitioning is used to do is distribute data evenly across multiple independent worker processes. Hash partitioning means that like values stay together, which introduces some skewing as a 99K out of 100K rows may partition together, causing one node to do almost all of the work. A round-robin partitioning means that all nodes get the same number of rows.

In your example, I couldn't see why you need to partition on all three, seems strange.

mab_arif16 · Post by **mab_arif16** » Tue Jun 27, 2006 11:13 pm

kcbland wrote:Yes. But...

Why partition on all three? Or is this just an example? All partitioning is used to do is distribute data evenly across multiple independent worker processes. Hash partitioning means that like values stay together, which introduces some skewing as a 99K out of 100K rows may partition together, causing one node to do almost all of the work. A round-robin partitioning means that all nodes get the same number of rows.

In your example, I couldn't see why you need to partition on all three, seems strange.

Ok
Here is my requirement ,I need to perform remove duplicate operation on zip and age but the duplicate which should be retained should have a higher value of apartment number .
I was trying to hash partition and sort on all three keys and remove duplicate on first two keys with to retain option as last.I am not sure if all the records with same zip and age arrive on same partition ,or is there some other way to accomplish it.

Thanks
Arif

bchau · Post by **bchau** » Tue Jun 27, 2006 11:37 pm

You could use a Remove Duplicates stage and set it to retain the last row.

I don't have access to EE atm but I do believe that link sort can also be used to remove duplicates if you enable unique and stable sort.

ray.wurlod · Post by **ray.wurlod** » Wed Jun 28, 2006 1:03 am

It would be sufficient to hash (or modulus) on zip.

Sorting on the other two columns is beneficial to the correct operation on the Remove Duplicates stage, but all you need to ensure is that each distinct value of zip (the grouping column) appears on exactly one node.

Round robin partitioning will not achieve this, as it is purely based on row number.

kumar_s · Post by **kumar_s** » Wed Jun 28, 2006 3:45 am

If we partition the data using all three keys in order zip,age,apt on two nodes ,does all records with zip 48201 arrive on the same node always.

Not necessarily....
If you mark all the three column as key for hash partition, the partiton will be done on the accumulation of all the three columns. Hence might have the chance of zip - 48201 to be distributed across two nodes.

Unless the hash partiton only on zip will not ensure your condition.

mab_arif16 · Post by **mab_arif16** » Wed Jun 28, 2006 8:59 am

ray.wurlod wrote:It would be sufficient to hash (or modulus) on zip.

Sorting on the other two columns is beneficial to the correct operation on the Remove Duplicates stage, but all you need to ensure is that each distinct value of zip (the grouping column) appears on exactly one node.
.

I tried sorting by apt first using sort stage then perform a remove duplicate by repartitioning the data on zip and age and sorting it,everything works fine but I am unable retain the record with higher apt.
Thanks
Arif

kumar_s · Post by **kumar_s** » Wed Jun 28, 2006 11:01 pm

Hash partition on Zip and Age. Mark the both as key in sort stage. Mark AllowDuplicates = False. Reverse the order of sort for your desired result. You can accomplish within single stage.

bakul · Post by **bakul** » Wed Sep 20, 2006 10:51 pm

For your requirement, you will need to hash partition on 2 keys but sort on all 3 keys. Use the correct sort order for the third key or use Duplicate to retain = 'last' . After this stage you can use remove duplicates with only the first 2 keys.

If the sort is done only on 2 keys, you will not be able to ensure that the highest value of third key is the one that is retained.

Hope this helps.