How to improve aggregator performance?
Posted: Thu Dec 10, 2009 8:54 am
In my job I have to sum data in 120 columns grouping by 10 other columns. Total number of rows to aggregate is about 5-6 millions. All rows are sorted and partitioned in sort stage before aggregation. But aggregation performance still very low on Aggregator stage - only 2000-3000 rows/sec
I tried to use 5 and 8 node in configuration files, but this didn't significantly affect the performance. And it's strange to me, but we have only 20-30% CPU usage while running this job.
Without Aggregator stage we have excellent performance on this job - reading from datasets, sorting, joining, filtering, output to file etc. are very fast.
Maybe there are some project parameters or other for increase performance of aggregation?
![Sad :(](./images/smilies/icon_sad.gif)
I tried to use 5 and 8 node in configuration files, but this didn't significantly affect the performance. And it's strange to me, but we have only 20-30% CPU usage while running this job.
Without Aggregator stage we have excellent performance on this job - reading from datasets, sorting, joining, filtering, output to file etc. are very fast.
Maybe there are some project parameters or other for increase performance of aggregation?