Range Partitioning Vs Hash

dstest · Post by **dstest** » Tue Oct 07, 2008 11:42 pm

I am processing 1 million records on 4 nodes.In tha parallel job i have join stage so i am doing partion on 4 key columns.Data is not evenly distributing on four nodes becuase some groups have more records and some groups have very less records.

In this case I used range partitioning and it is evenly distributing the records across all nodes.

Is there any disadvantage of using range partiitoning.Can any one please tell what are the advantages and disadvantages of range partitioning over hash in this scenario.Which is best partitioning method in my scenario.

Thanks
dstest

ray.wurlod · Post by **ray.wurlod** » Wed Oct 08, 2008 12:15 am

The only disadvantage is the need to preprocess your data to write the range map used by the partitioning algorithm. Along with this goes the need for a standard naming convention for your range maps so that the correct range map is associated with particular sets of data and the range map for one job is not simultaneously being destroyed by a concurrently running job.

There are special settings for Funnel and Collectors that work best with range-partitioned data.

dstest · Post by **dstest** » Wed Oct 08, 2008 12:59 am

Do we need to process all the data or just take sample records and then create a rangemap is sufficient.

Once we create a rangemap can we use the same one even in test,stage and prod and ongoing also do we need to run this process every time before running the main job.

Thanks
dstest

ray.wurlod · Post by **ray.wurlod** » Wed Oct 08, 2008 1:04 am

Safest is to pre-process all the data every time. Think about it.

You don't process the same data in production that you do in development.

You rarely re-process the same set of data.

If the data happen to be sorted and your sample is "the first n% of rows" your range map will be badly wrong.

If your sample is of the form "random n% of rows" you are processing all the rows anyway.

dstest · Post by **dstest** » Wed Oct 08, 2008 1:17 am

Thaks for your valuable suggestions