methods for aggregator stage

cetzhbo · Post by **cetzhbo** » Tue May 27, 2008 4:29 am

Hello Gurus,

in aggregator stage, there are two method:

method "hash" require hashing partition with grouping keys
method "sort" also require hashing partition for input with grouping
keys.

what's the difference for these two methods ?

thanks very much!

ray.wurlod · Post by **ray.wurlod** » Tue May 27, 2008 6:44 am

The difference is how memory is managed.

HASH method builds a hash table in memory with one row for each combination of grouping values. It can not generate any output rows until all rows have entered the Aggregator stage.

SORT does the same but flushes and frees that memory when any of the sorted columns changes value. It can do that because, since the column is sorted, we know that the previous value will never be seen again. Overall, this uses far less memory than the HASH method for aggregation.