Page 1 of 1

aggregator performance

Posted: Thu Mar 12, 2009 4:00 am
by dnat
I am using an aggregator stage just to count the number of rows from a particular link.

The design is like this

Seq file-->transformer-->aggregator-->seq file

Here i need the aggregator to count the total rows from transformer(the key is same for all the records), so it would pass through only one partition.

I am dealing with millions of records. Now we are doing development, but wanted to know how this would affect the performance. Or is there any other way to do this?

Posted: Thu Mar 12, 2009 4:52 am
by bkumar103
Are you getting just count of the record in the output Sequential file.
If yes then you can use wc -l < inputfilename > outputfilename to get the count.

Posted: Thu Mar 12, 2009 3:34 pm
by ray.wurlod
Do you really need the count as a separate operation? Why not calculate it as you are processing the actual file?

Re: aggregator performance

Posted: Thu Mar 12, 2009 7:49 pm
by sjaladurgam
Even I experienced same issue.But I tried keeping 2 Agg Stages and making first one with hash partitioning and second one with sequential that works fantastic.

Just try this.

Thanks.

Posted: Thu Mar 12, 2009 11:40 pm
by sima79
One aggregator stage (execution mode parallel) to count the rows in parallel then another aggregator stage (execution mode sequential) to sum up the counts from each partition. No need to use hash partitioning, round robin in this case would be better.

Posted: Fri Mar 13, 2009 12:49 am
by dnat
sima and sjaladurgam

So, the two aggregator stages would not hinder the performance while doing for millions of records???. i am just worried since the data is very huge..anyway, thanks for your input.

Ray, i am not sure how we can calculate while actual processing, because anyways i have to calculate withouth the partitioning to get the total count.

Re: aggregator performance

Posted: Fri Mar 13, 2009 3:19 am
by Sainath.Srinivasan
sjaladurgam wrote:...and second one with sequential ...

Posted: Fri Mar 13, 2009 5:32 am
by dnat
i made the first aggregaor as round robin and next as sequential mode. But the output is not correct.

The first aggregator shows as a collection type.

Posted: Fri Mar 13, 2009 6:03 am
by dnat
The first aggregator was showing as collection type because it was in sequential mode. I made it to parallel and partitioned in round robin. The second aggregator is in sequential mode. But it is not giving correct output.

Posted: Fri Mar 13, 2009 7:13 am
by Sainath.Srinivasan
What do you mean by "not giving correct data"?

Unless you share the results, it is not even possible to guess what is happening differently.