Page 1 of 1

sorting in the Aggregator

Posted: Mon Oct 12, 2015 7:47 pm
by wuruima
dear fri,

when using this stage, do we must do the sorting in the input link? Means tick the 'perform sort' check box, if we don't tick, will it possibility output the incorrect result ? Please kindly advise, thanks so much.

walter/

Re: sorting in the Aggregator

Posted: Mon Oct 12, 2015 9:00 pm
by naveenkumar.ssn
Hi ,

Doesnt matter whether you sort it or not depends on which aggregate function you are using, however it better to give the results in a sorted manner as input to the aggregate function for performance effective.

Thanks & Regards
Naveen

Posted: Mon Oct 12, 2015 10:22 pm
by ray.wurlod
You have to use sorted data if the aggregation mode is Sort. The Aggregator makes use of the fact that data are sorted by grouping keys to minimize the amount of memory it needs - it only need to keep one key value in memory.

Hash mode means that the Aggregator has to keep a table in memory with a row for every distinct value of grouping keys. If you estimate the size of this table at 1K per row, this will give you some feel for the amount of memory that that would required.

Hash mode is very suitable when there will be only a small number of distinct groups. Sort mode is highly suitable when there is a large number of distinct groups.

Hash mode is a "blocking operation" - that is, no rows can come out of the Aggregator until all input rows have been consumed. Sort mode is not a blocking operation (except at the individual group level, which is negligible).