Aggregator Partitioning

kpsita · Post by **kpsita** » Wed Oct 27, 2010 1:59 pm

Hi,

I have a question regarding aggregator stage. I am using aggragator stage in most of my jobs and the output also looks good. But is it mandatory to hash partition and sort by the grouping keys in aggregator stage.Currently it is defaulted to auto.

Thanks

kwwilliams · Post by **kwwilliams** » Wed Oct 27, 2010 2:13 pm

I think the question should really be how does auto partitioning work in a job with an aggregator?

First I would ask you to look at your dump score in the job to see what type of partitioining is occurring in the job and where. This will answer your question for you.

Mandatory? No, it is not for all cases neccesary to insert a hash partition. However in some cases, it would be neccesarry. Sorting depends on the aggregator method used:

"Use hash mode for a relatively small number of groups; generally, fewer than about 1000 groups per megabyte of memory. Sort mode requires the input data set to have been partition sorted with all of the grouping keys specified as hashing and sorting keys."

soumya5891 · Post by **soumya5891** » Sat Mar 12, 2011 12:36 pm

It is better to use hash partition whenever you are working on group of data like aggregator,sort,remove duplicate.

ray.wurlod · Post by **ray.wurlod** » Sat Mar 12, 2011 2:16 pm

soumya5891 wrote:It is better to use hash partition whenever you are working on group of data like aggregator,sort,remove duplicate.

That's not always true. For example, if the grouping key is an integer of some kind, then Modulus should be preferred, as it's more efficient than Hash.

DSXchange

Aggregator Partitioning

Aggregator Partitioning

Re: Aggregator Partitioning

Re: Aggregator Partitioning

Re: Aggregator Partitioning