Partitioning for Aggregator

ThilSe · Post by **ThilSe** » Thu Aug 23, 2012 1:22 am

Hi,

When using the aggregator in sort option the input record must be sorted using ALL the aggregate key columns, but, is it mandatory to partition it based on ALL the key columns (as documentation specifies) - isn't it enough if it is partitioned based on just one of the Key column?

Thanks and Regards,
Senthil

akarsh · Post by **akarsh** » Thu Aug 23, 2012 2:04 am

I think we need to use the key on which we are going to aggregate.if its a single key than it should work.

ArndW · Post by **ArndW** » Thu Aug 23, 2012 2:12 am

When you have multiple keys it is sufficient to partition on just the first of those keys.

ray.wurlod · Post by **ray.wurlod** » Thu Aug 23, 2012 4:53 am

It usually suffices to partition on just the first grouping (sorting) key. If this has only a few values in its domain, add the second grouping (sorting) key to the partitioning algorithm too.

ThilSe · Post by **ThilSe** » Fri Aug 24, 2012 10:10 am

Hi - Thanks for your comments.

Ray/ArndW - Is it mandatory to pick only the 'first' key in sort as partition key? Will there be any issues if the data is partitioned on the fourth or fifth key?

The reason I am asking this is, I am dealing with huge input datasets (>500million records) that is partitioned on one key (think of it as account #/customer # that provides good distribution) and sorted on some six keys (with partition key column fifth in sort order).

I am doing aggregation multiple times on this dataset. I want to avoid repartitioning and sorting of this dataset again unless it is absolutely necessay. Thats why I thought I will get clarity on this.

Regards,
Senthil

ray.wurlod · Post by **ray.wurlod** » Fri Aug 24, 2012 4:43 pm

Nothing is mandatory. But think about what will happen if you don't partition on the first sort key value. The same value of the first sort key may occur on more than one partition which means you will end up with more groups than there really are.

DSXchange

Partitioning for Aggregator

Partitioning for Aggregator

Re: Partitioning for Aggregator

Re: Partitioning for Aggregator