Partitioning for Aggregator

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ThilSe
Participant
Posts: 80
Joined: Thu Jun 09, 2005 7:45 am

Partitioning for Aggregator

Post by ThilSe »

Hi,

When using the aggregator in sort option the input record must be sorted using ALL the aggregate key columns, but, is it mandatory to partition it based on ALL the key columns (as documentation specifies) - isn't it enough if it is partitioned based on just one of the Key column?

Thanks and Regards,
Senthil
akarsh
Participant
Posts: 51
Joined: Fri May 09, 2008 4:03 am
Location: Pune

Re: Partitioning for Aggregator

Post by akarsh »

I think we need to use the key on which we are going to aggregate.if its a single key than it should work.
Thanks,
Akarsh Kapoor
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

When you have multiple keys it is sufficient to partition on just the first of those keys.
Last edited by ArndW on Thu Aug 23, 2012 5:19 am, edited 1 time in total.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It usually suffices to partition on just the first grouping (sorting) key. If this has only a few values in its domain, add the second grouping (sorting) key to the partitioning algorithm too.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ThilSe
Participant
Posts: 80
Joined: Thu Jun 09, 2005 7:45 am

Re: Partitioning for Aggregator

Post by ThilSe »

Hi - Thanks for your comments.

Ray/ArndW - Is it mandatory to pick only the 'first' key in sort as partition key? Will there be any issues if the data is partitioned on the fourth or fifth key?

The reason I am asking this is, I am dealing with huge input datasets (>500million records) that is partitioned on one key (think of it as account #/customer # that provides good distribution) and sorted on some six keys (with partition key column fifth in sort order).

I am doing aggregation multiple times on this dataset. I want to avoid repartitioning and sorting of this dataset again unless it is absolutely necessay. Thats why I thought I will get clarity on this.

Regards,
Senthil
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Nothing is mandatory. But think about what will happen if you don't partition on the first sort key value. The same value of the first sort key may occur on more than one partition which means you will end up with more groups than there really are.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply