Hash Partitioning on columns with same values

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Madhu1981
Participant
Posts: 69
Joined: Wed Feb 22, 2006 7:49 am

Hash Partitioning on columns with same values

Post by Madhu1981 »

Hi All,

I have configuration file with 4 nodes. I have a job where i need to do the hash partitioning on One column (assume column name as A) and i have million records coming from the source and all values are same for the column A.

When i perform hash partioning will it partition into 4 nodes or all the data will move into one node..
Kindly Clarify me.

thanks in advance
thumsup9
Charter Member
Charter Member
Posts: 168
Joined: Fri Feb 18, 2005 11:29 am

Post by thumsup9 »

Just copied it from dsx pdf...

Although the data is distributed across partitions, the hash partitioner ensures that records with identical keys are in the same partition, allowing duplicates to be found.
Hash partitioning does not necessarily result in an even distribution of data between partitions. For example, if you hash partition a data set based on a zip code field, where a large percentage of your records are from one or two zip codes, you can end up with a few partitions containing most of your records. This behavior can lead to bottlenecks because some nodes are required to process more records than other nodes.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

All rows go to one node. Hash means same values stay together on a node.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The reason for that is that every "A" will generate the same hashvalue.

It's the same as in SQL - if you group by a column that contains only one distinct value you will end up with one group.

Prefer Round Robin or Random, or partition on a different key.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

You need to decide the partiton based on your requirement. If you need to a grouping function like, aggregation(count of records)... you have to follow the grouping partition (hash) else you can proceed with what has been suggested.
For grouping in your case you can even go for sequential mode :wink:
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Post Reply