Count the sum of of records in a particular column

Hope · Post by **Hope** » Sun May 13, 2007 3:04 pm

I have a file with 50 million records.I want to calculate the sum of rows in a particular column.I can do this aggregator but as per my knowledge I know that I can do that based on key .How can I calculate the sum 50 million records which has no unique key. I have to calculate the sum of all the records in these 45 columns and write the total in to a seperate file.

column1 :$xxxx(sum of all the records in column1)
Coulmn2 :$yyyy(sum of all the records in column2)
Column3 :$zzzz(sum of all the records in column3)

ray.wurlod · Post by **ray.wurlod** » Sun May 13, 2007 4:16 pm

Welcome aboard. :D

It works the same way as in SQL - if there's no grouping key you get one row containing the results of the aggregate functions.

Hope · Post by **Hope** » Sun May 13, 2007 8:55 pm

Could you please be more specific?.My source is SQl and target is a flat file.I have to extract the data in to the flat file and sum the records on some particular columns.The data should be written on one flat file and the sum of records on columns should be written to 1 more file.I am able write the data on to the flat file.But I am unable to do the sum.How do I achieve this?.Is there any function I can use in transformer to calculate the sum of all the records or can I do this in using Aggregator with out mentioning a group key?.

chulett · Post by **chulett** » Sun May 13, 2007 9:31 pm

Hope wrote:or can I do this in using Aggregator with out mentioning a group key?.

That way.

ray.wurlod · Post by **ray.wurlod** » Sun May 13, 2007 11:02 pm

As I said, same rules as SQL. The Aggregator stage does not have Group Key as a mandatory property.

Hope · Post by **Hope** » Mon May 14, 2007 6:48 am

When I try to remove the Group key.I get an error on the aggregator stage saying that Group key not determined.Can you please tell me how can I get the sum of columns with out the Group key?

Maveric · Post by **Maveric** » Mon May 14, 2007 7:44 am

Not sure abt the group key in aggregator stage. You can probably generate a dummy column in the querry and populate it with any value like "A" or "1". And in the aggregator stage group by the dummy column. This is more like a work around if nothing else works.

rafik2k · Post by **rafik2k** » Mon May 14, 2007 7:54 am

In output link in aggregator stage, what devivation value you have?
You don't need any group by clause

You will have value like this

Code: Select all

OUTPUT_column_name		derivation
C1				SUM(INPUT COL1)
C2				SUM(INPUT COL2)
C3				SUM(INPUT COL2)

thompsonp · Post by **thompsonp** » Mon May 14, 2007 8:15 am

The behaviour of the Aggregator in a Server job is different to that in a Parallel job.

In Server you can choose not to have any grouping keys.
In a Parallel job you must provide one.

You could try both and see which is more performant. Use the workaround described above in the parallel job i.e. grouping by a dummy field.

As you have quite a lot of data you might choose to use two aggregators. Run the first as a parallel stage and partition round robin, then pass the results to a second aggregator that runs sequentially to give you a single output row.

As an aside you will find that a transformer, or better still a build op, runs much quicker if you are aggregating fields that are not doubles. This is down to the datatype conversion that the aggregator performs.