Requirements for Aggregator Stage

Inquisitive · Post by **Inquisitive** » Thu Jan 18, 2007 6:02 pm

Hi,

Is it Mandotory to Sort the incoming data when we use Aggreator Stage?

I have scenario where in I am reading data from a Dataset then to Aggregator and the writing into Dataset . Source Data set is Hash partitioned with GROUP BY keys in same order.

I am trying to process 1 million records with 6 GROUP BY keys and it is sum operation on other 6 fields. Without sort option it is taking 5 mins and with sort stage before Aggregator it is taking 12 mins.
I am trying to identify why it is taking so much time to process this data.

When I checked the log file I saw job is waiting for long time at the last entry as mentioned below before completing it.
(This is happening only when I try to Sort the data)

Event: aggDateSales,0: Hash table has grown to 16384 entries.
Event: aggDateSales,1: Hash table has grown to 32768 entries.
Event: aggDateSales,1: Hash table has grown to 32768 entries.
Event: aggDateSales,0: Hash table has grown to 32768 entries.
Event: aggDateSales,0: Hash table has grown to 65536 entries.
Event: aggDateSales,1: Hash table has grown to 65536 entries.
Event: aggDateSales,0: Hash table has grown to 131072 entries.
Event: aggDateSales,1: Hash table has grown to 131072 entries.
Event: aggDateSales,0: Hash table has grown to 262144 entries.
Event: aggDateSales,1: Hash table has grown to 262144 entries.
Event: aggDateSales,0: Hash table has grown to 524288 entries.
Event: aggDateSales,1: Hash table has grown to 524288 entries.

could it be because of space issue?

And, wheere data is temporarily stored when Aggregator stage could not hold all data in the memory? Is it in the resource disk as mentioned in the Config file?

Note: Sorry for asking too many questions in the message. I am not trying to make it complex but I am trying to give all information and trying to tie them together to understand how exactly Aggreator works and how can I tune this job.

Thanks

kcbland · Post by **kcbland** » Thu Jan 18, 2007 6:30 pm

So many questions, maybe you should number them?

1. No, not mandatory.
2. No, the hash table is the temporary file holding the summarized groups. The more distinct groups, the more space, and thus tthe longer runtime because the number of rows being stored is greater.
3. You've answer your own next question.
4. Yes, thats why you allocate a bunch of scratch disk space. It gives the illusion of keeping data in memory, but it's actually scratching and swapping.

By sorting, the hash table stays small because it only keeps the current group in memory, as opposed to temporarily holding all groups in memory. You need to weigh the benefits of sorting versus just grinding thru the data.

Imagine if all of the rows summarized into a single result row. Sorting would have no benefit and would just waste a lot of time. Likewise, if every input row was a unique row then sorting would waste a lot of time but the Aggregator doesn't have to build a huge hash table to hold every row. You need to profile your data and decide which methods make sense for the "average" run of the data and how much performance degradation occurs if you deviate from that. Choose the method that consistently gives you the best performance for a large range of your data profiling characteristics.

Inquisitive · Post by **Inquisitive** » Thu Jan 18, 2007 7:22 pm

Thanks Kenneth,

That is very clear explaination.

I am seeing 1 million records in the source is grouping to 700,000 records in the target.

So it seems the final decision point is that which one is faster sorting the data or letting Aggregator build those Hash tables.

In my case I found building Hash Table is faster.

So I assume we dont have to sort the data even in following scenario.

- Reading data from a sequential file which is not sorted and running the job in 4 nodes .

In the above scenario it builds up all data into those Hash tables right.

Thanks

ray.wurlod · Post by **ray.wurlod** » Thu Jan 18, 2007 8:04 pm

Only until you exhaust available memory for the hash table, then it must spill to disk, which is quite a slowdown.