Count the sum of of records in a particular column

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Hope
Participant
Posts: 97
Joined: Sun May 13, 2007 2:51 pm
Contact:

Count the sum of of records in a particular column

Post by Hope »

I have a file with 50 million records.I want to calculate the sum of rows in a particular column.I can do this aggregator but as per my knowledge I know that I can do that based on key .How can I calculate the sum 50 million records which has no unique key. I have to calculate the sum of all the records in these 45 columns and write the total in to a seperate file.

column1 :$xxxx(sum of all the records in column1)
Coulmn2 :$yyyy(sum of all the records in column2)
Column3 :$zzzz(sum of all the records in column3)
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Welcome aboard. :D

It works the same way as in SQL - if there's no grouping key you get one row containing the results of the aggregate functions.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Hope
Participant
Posts: 97
Joined: Sun May 13, 2007 2:51 pm
Contact:

Post by Hope »

Could you please be more specific?.My source is SQl and target is a flat file.I have to extract the data in to the flat file and sum the records on some particular columns.The data should be written on one flat file and the sum of records on columns should be written to 1 more file.I am able write the data on to the flat file.But I am unable to do the sum.How do I achieve this?.Is there any function I can use in transformer to calculate the sum of all the records or can I do this in using Aggregator with out mentioning a group key?.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Hope wrote:or can I do this in using Aggregator with out mentioning a group key?.
That way. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

As I said, same rules as SQL. The Aggregator stage does not have Group Key as a mandatory property.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Hope
Participant
Posts: 97
Joined: Sun May 13, 2007 2:51 pm
Contact:

Post by Hope »

When I try to remove the Group key.I get an error on the aggregator stage saying that Group key not determined.Can you please tell me how can I get the sum of columns with out the Group key?
Maveric
Participant
Posts: 388
Joined: Tue Mar 13, 2007 1:28 am

Post by Maveric »

Not sure abt the group key in aggregator stage. You can probably generate a dummy column in the querry and populate it with any value like "A" or "1". And in the aggregator stage group by the dummy column. This is more like a work around if nothing else works.
rafik2k
Participant
Posts: 182
Joined: Wed Nov 23, 2005 1:36 am
Location: Sydney

Post by rafik2k »

In output link in aggregator stage, what devivation value you have?
You don't need any group by clause

You will have value like this

Code: Select all

OUTPUT_column_name		derivation
C1				SUM(INPUT COL1)
C2				SUM(INPUT COL2)
C3				SUM(INPUT COL2)
thompsonp
Premium Member
Premium Member
Posts: 205
Joined: Tue Mar 01, 2005 8:41 am

Post by thompsonp »

The behaviour of the Aggregator in a Server job is different to that in a Parallel job.

In Server you can choose not to have any grouping keys.
In a Parallel job you must provide one.

You could try both and see which is more performant. Use the workaround described above in the parallel job i.e. grouping by a dummy field.

As you have quite a lot of data you might choose to use two aggregators. Run the first as a parallel stage and partition round robin, then pass the results to a second aggregator that runs sequentially to give you a single output row.

As an aside you will find that a transformer, or better still a build op, runs much quicker if you are aggregating fields that are not doubles. This is down to the datatype conversion that the aggregator performs.
Post Reply