Hash Partitioning

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
mab_arif16
Charter Member
Charter Member
Posts: 87
Joined: Sat Mar 18, 2006 11:45 pm

Hash Partitioning

Post by mab_arif16 »

Hi
Suppose we have records like

zip,age,apt
48201,20,2
48201,20,1
77058,60,2
77058,20,2
48201,30,5
85674,35,90
77058,10,60
48201,60,30

If we partition the data using all three keys in order zip,age,apt on two nodes ,does all records with zip 48201 arrive on the same node always.

Thanks
Arif
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Yes. But...


Why partition on all three? Or is this just an example? All partitioning is used to do is distribute data evenly across multiple independent worker processes. Hash partitioning means that like values stay together, which introduces some skewing as a 99K out of 100K rows may partition together, causing one node to do almost all of the work. A round-robin partitioning means that all nodes get the same number of rows.

In your example, I couldn't see why you need to partition on all three, seems strange.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
mab_arif16
Charter Member
Charter Member
Posts: 87
Joined: Sat Mar 18, 2006 11:45 pm

Post by mab_arif16 »

kcbland wrote:Yes. But...


Why partition on all three? Or is this just an example? All partitioning is used to do is distribute data evenly across multiple independent worker processes. Hash partitioning means that like values stay together, which introduces some skewing as a 99K out of 100K rows may partition together, causing one node to do almost all of the work. A round-robin partitioning means that all nodes get the same number of rows.

In your example, I couldn't see why you need to partition on all three, seems strange.
Ok
Here is my requirement ,I need to perform remove duplicate operation on zip and age but the duplicate which should be retained should have a higher value of apartment number .
I was trying to hash partition and sort on all three keys and remove duplicate on first two keys with to retain option as last.I am not sure if all the records with same zip and age arrive on same partition ,or is there some other way to accomplish it.

Thanks
Arif
bchau
Charter Member
Charter Member
Posts: 46
Joined: Tue Jan 18, 2005 7:39 am
Location: Japan
Contact:

Post by bchau »

You could use a Remove Duplicates stage and set it to retain the last row.

I don't have access to EE atm but I do believe that link sort can also be used to remove duplicates if you enable unique and stable sort.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It would be sufficient to hash (or modulus) on zip.

Sorting on the other two columns is beneficial to the correct operation on the Remove Duplicates stage, but all you need to ensure is that each distinct value of zip (the grouping column) appears on exactly one node.

Round robin partitioning will not achieve this, as it is purely based on row number.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

If we partition the data using all three keys in order zip,age,apt on two nodes ,does all records with zip 48201 arrive on the same node always.
Not necessarily....
If you mark all the three column as key for hash partition, the partiton will be done on the accumulation of all the three columns. Hence might have the chance of zip - 48201 to be distributed across two nodes.

Unless the hash partiton only on zip will not ensure your condition.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
mab_arif16
Charter Member
Charter Member
Posts: 87
Joined: Sat Mar 18, 2006 11:45 pm

Post by mab_arif16 »

ray.wurlod wrote:It would be sufficient to hash (or modulus) on zip.

Sorting on the other two columns is beneficial to the correct operation on the Remove Duplicates stage, but all you need to ensure is that each distinct value of zip (the grouping column) appears on exactly one node.
.
I tried sorting by apt first using sort stage then perform a remove duplicate by repartitioning the data on zip and age and sorting it,everything works fine but I am unable retain the record with higher apt.
Thanks
Arif
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hash partition on Zip and Age. Mark the both as key in sort stage. Mark AllowDuplicates = False. Reverse the order of sort for your desired result. You can accomplish within single stage.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
bakul
Participant
Posts: 60
Joined: Wed Nov 10, 2004 2:12 am

Post by bakul »

For your requirement, you will need to hash partition on 2 keys but sort on all 3 keys. Use the correct sort order for the third key or use Duplicate to retain = 'last' . After this stage you can use remove duplicates with only the first 2 keys.

If the sort is done only on 2 keys, you will not be able to ensure that the highest value of third key is the one that is retained.

Hope this helps.
Regards,
Bakul
Post Reply