Page 1 of 1

How to improve aggregator performance?

Posted: Thu Dec 10, 2009 8:54 am
by Alexander Scherbina
In my job I have to sum data in 120 columns grouping by 10 other columns. Total number of rows to aggregate is about 5-6 millions. All rows are sorted and partitioned in sort stage before aggregation. But aggregation performance still very low on Aggregator stage - only 2000-3000 rows/sec :(

I tried to use 5 and 8 node in configuration files, but this didn't significantly affect the performance. And it's strange to me, but we have only 20-30% CPU usage while running this job.

Without Aggregator stage we have excellent performance on this job - reading from datasets, sorting, joining, filtering, output to file etc. are very fast.

Maybe there are some project parameters or other for increase performance of aggregation?

Posted: Thu Dec 10, 2009 4:19 pm
by ray.wurlod
Welcome aboard. What execution mode are you using for the Aggregator stage? What aggregation mode (sort or hash) are you using?

Posted: Fri Dec 11, 2009 2:11 am
by Alexander Scherbina
ray.wurlod wrote:Welcome aboard. What execution mode are you using for the Aggregator stage? What aggregation mode (sort or hash) are you using? ...
Execution mode is parralel. Aggregation mode is sort.

I forgot to mention that there is no warnings in the job log.

Posted: Fri Dec 11, 2009 2:42 pm
by ray.wurlod
Ignore rows/sec as a metric of performance because they are meaningless on the output of an Aggregator stage; the clock is running during all of the wait time while rows are coming in.

Posted: Tue Dec 15, 2009 7:12 am
by sridinesh2009
in aggregator stage use this option... ur performance may increase

METHOD=Hash

Posted: Tue Dec 15, 2009 7:20 am
by priyadarshikunal
sridinesh2009 wrote:in aggregator stage use this option... ur performance may increase

METHOD=Hash
Please don't use SMS/text style words as this is not a mobile phone.

As the OP said he has 5-6 million records to aggregate and Hash method is only used when number of records are less. In case you set method to hash it starts thowing warning when number of records reaches 16K mark. Also there are other implications when hash grows beyond a level.