Page 1 of 1

Partitioning for Aggregator

Posted: Thu Aug 23, 2012 1:22 am
by ThilSe
Hi,

When using the aggregator in sort option the input record must be sorted using ALL the aggregate key columns, but, is it mandatory to partition it based on ALL the key columns (as documentation specifies) - isn't it enough if it is partitioned based on just one of the Key column?

Thanks and Regards,
Senthil

Re: Partitioning for Aggregator

Posted: Thu Aug 23, 2012 2:04 am
by akarsh
I think we need to use the key on which we are going to aggregate.if its a single key than it should work.

Posted: Thu Aug 23, 2012 2:12 am
by ArndW
When you have multiple keys it is sufficient to partition on just the first of those keys.

Posted: Thu Aug 23, 2012 4:53 am
by ray.wurlod
It usually suffices to partition on just the first grouping (sorting) key. If this has only a few values in its domain, add the second grouping (sorting) key to the partitioning algorithm too.

Re: Partitioning for Aggregator

Posted: Fri Aug 24, 2012 10:10 am
by ThilSe
Hi - Thanks for your comments.

Ray/ArndW - Is it mandatory to pick only the 'first' key in sort as partition key? Will there be any issues if the data is partitioned on the fourth or fifth key?

The reason I am asking this is, I am dealing with huge input datasets (>500million records) that is partitioned on one key (think of it as account #/customer # that provides good distribution) and sorted on some six keys (with partition key column fifth in sort order).

I am doing aggregation multiple times on this dataset. I want to avoid repartitioning and sorting of this dataset again unless it is absolutely necessay. Thats why I thought I will get clarity on this.

Regards,
Senthil

Posted: Fri Aug 24, 2012 4:43 pm
by ray.wurlod
Nothing is mandatory. But think about what will happen if you don't partition on the first sort key value. The same value of the first sort key may occur on more than one partition which means you will end up with more groups than there really are.