Range Partitioning Vs Hash

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dstest
Participant
Posts: 66
Joined: Sun Aug 19, 2007 10:52 pm

Range Partitioning Vs Hash

Post by dstest »

I am processing 1 million records on 4 nodes.In tha parallel job i have join stage so i am doing partion on 4 key columns.Data is not evenly distributing on four nodes becuase some groups have more records and some groups have very less records.

In this case I used range partitioning and it is evenly distributing the records across all nodes.

Is there any disadvantage of using range partiitoning.Can any one please tell what are the advantages and disadvantages of range partitioning over hash in this scenario.Which is best partitioning method in my scenario.

Thanks
dstest
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The only disadvantage is the need to preprocess your data to write the range map used by the partitioning algorithm. Along with this goes the need for a standard naming convention for your range maps so that the correct range map is associated with particular sets of data and the range map for one job is not simultaneously being destroyed by a concurrently running job.

There are special settings for Funnel and Collectors that work best with range-partitioned data.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dstest
Participant
Posts: 66
Joined: Sun Aug 19, 2007 10:52 pm

Post by dstest »

Do we need to process all the data or just take sample records and then create a rangemap is sufficient.

Once we create a rangemap can we use the same one even in test,stage and prod and ongoing also do we need to run this process every time before running the main job.

Thanks
dstest
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Safest is to pre-process all the data every time. Think about it.
  • You don't process the same data in production that you do in development.

    You rarely re-process the same set of data.

    If the data happen to be sorted and your sample is "the first n% of rows" your range map will be badly wrong.

    If your sample is of the form "random n% of rows" you are processing all the rows anyway.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dstest
Participant
Posts: 66
Joined: Sun Aug 19, 2007 10:52 pm

Post by dstest »

Thaks for your valuable suggestions
Post Reply