Count the sum of of records in a particular column
Moderators: chulett, rschirm, roy
Count the sum of of records in a particular column
I have a file with 50 million records.I want to calculate the sum of rows in a particular column.I can do this aggregator but as per my knowledge I know that I can do that based on key .How can I calculate the sum 50 million records which has no unique key. I have to calculate the sum of all the records in these 45 columns and write the total in to a seperate file.
column1 :$xxxx(sum of all the records in column1)
Coulmn2 :$yyyy(sum of all the records in column2)
Column3 :$zzzz(sum of all the records in column3)
column1 :$xxxx(sum of all the records in column1)
Coulmn2 :$yyyy(sum of all the records in column2)
Column3 :$zzzz(sum of all the records in column3)
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Could you please be more specific?.My source is SQl and target is a flat file.I have to extract the data in to the flat file and sum the records on some particular columns.The data should be written on one flat file and the sum of records on columns should be written to 1 more file.I am able write the data on to the flat file.But I am unable to do the sum.How do I achieve this?.Is there any function I can use in transformer to calculate the sum of all the records or can I do this in using Aggregator with out mentioning a group key?.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
In output link in aggregator stage, what devivation value you have?
You don't need any group by clause
You will have value like this
You don't need any group by clause
You will have value like this
Code: Select all
OUTPUT_column_name derivation
C1 SUM(INPUT COL1)
C2 SUM(INPUT COL2)
C3 SUM(INPUT COL2)
The behaviour of the Aggregator in a Server job is different to that in a Parallel job.
In Server you can choose not to have any grouping keys.
In a Parallel job you must provide one.
You could try both and see which is more performant. Use the workaround described above in the parallel job i.e. grouping by a dummy field.
As you have quite a lot of data you might choose to use two aggregators. Run the first as a parallel stage and partition round robin, then pass the results to a second aggregator that runs sequentially to give you a single output row.
As an aside you will find that a transformer, or better still a build op, runs much quicker if you are aggregating fields that are not doubles. This is down to the datatype conversion that the aggregator performs.
In Server you can choose not to have any grouping keys.
In a Parallel job you must provide one.
You could try both and see which is more performant. Use the workaround described above in the parallel job i.e. grouping by a dummy field.
As you have quite a lot of data you might choose to use two aggregators. Run the first as a parallel stage and partition round robin, then pass the results to a second aggregator that runs sequentially to give you a single output row.
As an aside you will find that a transformer, or better still a build op, runs much quicker if you are aggregating fields that are not doubles. This is down to the datatype conversion that the aggregator performs.