Is there any limit for aggregator stage in handling rows

VasanthRM · Post by **VasanthRM** » Fri Aug 19, 2005 6:51 am

Is there any limit for aggregator stage in handling rows

Every time i run the job it fails with a msg 'aggregator terminating abnormally' so is it because of the huge input volume.

i am not able to process huge input of volume of 3 lakh records. i basically split the input volume to handle this situation.

is there any other way to handle the situation.. aggregator stage works fine with input volume less than 2 lakh

is there any limit for aggregator stage to handle input if so how much?

thanks in advance,

clarcombe · Post by **clarcombe** » Fri Aug 19, 2005 7:10 am

Is your data sorted. Aggregator works MUCH better on sorted data

pnchowdary · Post by **pnchowdary** » Fri Aug 19, 2005 7:13 am

Hi VasanthRm,

I dont think there is any limit on the number of rows the aggregator stage can handle. However, the data you are supplying to the aggregator must be sorted and the sort order must be mentioned in the aggregator.

Could you please let me know whether you are sorting the data before you send it to the aggregator?

ArndW · Post by **ArndW** » Fri Aug 19, 2005 7:19 am

Hmmm... a lakh is 10-Man which is 86A0x. Very clear...

As Clarcombe mentioned, if your data is sorted then the Aggregator stage doesn't need to load and keep each & record in memory, it only needs to compute values until the next group level change.

Using sorted input is the best and most efficient method.

The limit that the aggregator stage has is memory. You can roughly estimate that (on unsorted data) the whole data stream needs to be kept in virtual memory until the last row has been read. Check your data size and look at your ulimit value and it will probably be quite clear why your process is aborting.

chulett · Post by **chulett** » Fri Aug 19, 2005 7:31 am

As noted, there is a definite limit on the number of rows the aggregator can handle. That number, of course, isn't a fixed number as the total size of the rows being aggregated is more of the issue.

The only way I've found to successfully agg millions of rows is by presorting them. Then it only needs to keep one 'sort group' in memory at a time and can push rows through when there is a change.

But don't think the Sort stage can handle millions of rows either! You generally need to fall back on command line sort options or database sorts to accomplish that task for you.

VasanthRM · Post by **VasanthRM** » Fri Aug 19, 2005 8:06 am

thanks for the info,

but is there any limit for sorter stage... it is quite logical the sorted data can be better handled by agregator

to add upon,

can i use a hash file stage if my application is to sort the data and juz eliminate the duplicates and use the first incoming one. which algorithm in the hash file stage will suite my request

any idea.......this is seems very interesting to

ray.wurlod · Post by **ray.wurlod** » Fri Aug 19, 2005 4:58 pm

You can use a UV stage (or any other database stage) to sort your data.

If the data are in a file, however, you may find it easier - and faster - to use the UNIX sort command (perhaps as a before/stage subroutine) to effect the sort. Sort by the grouping column(s).

You must then inform the Aggregator stage - on its input link - that the data are sorted by the grouping column(s).

chulett · Post by **chulett** » Fri Aug 19, 2005 7:09 pm

ray.wurlod wrote:You must then inform the Aggregator stage - on its input link - that the data are sorted by the grouping column(s).

And don't even think about lying to it - the stage will bust you for that. Big time.