Sum, Count on varchar in Aggregator
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 353
- Joined: Mon Jan 17, 2011 5:03 am
- Location: Mumbai, India
Sum, Count on varchar in Aggregator
Hi All,
I want to calculate SUM and COUNT on varchar field in Aggregator.
I have searched many posts and found out that "Preserve Type" option is present but it works only for MIN and MAX functions.
Can any one suggest how to accomplish this ?
I want to calculate SUM and COUNT on varchar field in Aggregator.
I have searched many posts and found out that "Preserve Type" option is present but it works only for MIN and MAX functions.
Can any one suggest how to accomplish this ?
Thanx and Regards,
ETL User
ETL User
-
- Participant
- Posts: 134
- Joined: Tue Jun 15, 2010 2:10 am
- Location: Bangalore
Re: Sum, Count on varchar in Aggregator
It works for even COUNT but am not sure of SUM.
N.Srinivas
India.
India.
Hi Chandra,
It is impossible to perform a count over a field in DataStage, you can only count rows and the result is always a double.
It is generally impossible to determine the sum of a character string, except the string can be implicitly converted to a numeric value. The result of generating a sum output-column in DataStage is by default a double as well. You may choose to switch this to decimal with a defined precision and scale. But the result of a column derived by sum or count will never be a Varchar. You'll have to do that manually afterwards if needed.
It is impossible to perform a count over a field in DataStage, you can only count rows and the result is always a double.
It is generally impossible to determine the sum of a character string, except the string can be implicitly converted to a numeric value. The result of generating a sum output-column in DataStage is by default a double as well. You may choose to switch this to decimal with a defined precision and scale. But the result of a column derived by sum or count will never be a Varchar. You'll have to do that manually afterwards if needed.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
There are the grateful those are happy." Francis Bacon
-
- Participant
- Posts: 134
- Joined: Tue Jun 15, 2010 2:10 am
- Location: Bangalore
Re: Sum, Count on varchar in Aggregator
My mistake I mean COUNT on the rows only,but not on a field.
N.Srinivas
India.
India.
-
- Premium Member
- Posts: 353
- Joined: Mon Jan 17, 2011 5:03 am
- Location: Mumbai, India
Thanks for the reply.
Actually the SUM is on a decimal field but the group column is a varchar(like a name). I have tried a lot of things but getting errors
For the above error too, I have not found a specific solution , only round way solution.
@Sreenivas
Actually the SUM is on a decimal field but the group column is a varchar(like a name). I have tried a lot of things but getting errors
Code: Select all
Error when checking operator: It is not legal to set
a reducer (AGE_GRP) to be the same as a key field,
when the type of that field (int32) is not a double.
@Sreenivas
Even for COUNT I have to specify a column in Aggregator, that too is a varchar column.My mistake I mean COUNT on the rows only,but not on a field.
Thanx and Regards,
ETL User
ETL User
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Why don't you show us your exact working? Grouping columns, calculation method, output column names and data types, and calculation functions.
It is not permitted to sum a VarChar. Nor is it meaningful. Convert to appropriate numeric data type upstream of the Aggregator if the string contains only numbers.
It is not permitted to sum a VarChar. Nor is it meaningful. Convert to appropriate numeric data type upstream of the Aggregator if the string contains only numbers.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 353
- Joined: Mon Jan 17, 2011 5:03 am
- Location: Mumbai, India
Sorry Ray, not able to view your whole response. But got the idea so here it is; My data is like
O/P data will look like
Now I have tried two methods.
1) Using only one Aggregator,
Group = AgeGroup
Group Type = Calculation
Column for Calculation = AgeGroup
Non Missing Values Count O/P Column = CNT
Column for Calculation = Balance
Sum O/P column = SumOfBalance
Here I am calculating the count and sum in a single aggregator and the grouping is on AgeGroup(VARCHAR 10).
2) Using two aggegators, one for calculation count and other for summation(using fork join)
In both the methods I am getting error which I have mentioned above.
I know a round way solution but dont want to use it. Can there be a direct solution ?
Code: Select all
AgeGroup Balance
18-20 100.0
18-20 200.0
20-25 1000.0
20-25 2000.0
20-25 3000.0
30-35 300.0
Code: Select all
AgeGroup Count SumOfBalance
18-20 2 300.0
20-25 3 6000.0
30-35 1 300.0
1) Using only one Aggregator,
Group = AgeGroup
Group Type = Calculation
Column for Calculation = AgeGroup
Non Missing Values Count O/P Column = CNT
Column for Calculation = Balance
Sum O/P column = SumOfBalance
Here I am calculating the count and sum in a single aggregator and the grouping is on AgeGroup(VARCHAR 10).
2) Using two aggegators, one for calculation count and other for summation(using fork join)
In both the methods I am getting error which I have mentioned above.
I know a round way solution but dont want to use it. Can there be a direct solution ?
Thanx and Regards,
ETL User
ETL User
-
- Premium Member
- Posts: 730
- Joined: Tue Nov 04, 2008 10:14 am
- Location: Bangalore
-
- Premium Member
- Posts: 353
- Joined: Mon Jan 17, 2011 5:03 am
- Location: Mumbai, India
-
- Premium Member
- Posts: 536
- Joined: Thu Oct 11, 2007 1:48 am
- Location: Bangalore
Hi Chandra,
Below is the design:-
Seqfile----> ColGen---->Aggregator--->SeqFile
Generate DUMMY column with value 1.
In Aggregator perform below derivation:-
Grouping Key:- AgeGroup
Aggregation Type :- Calculation
Column For calculation:- DUMMY
Sum Output Column:- Count
Column For Calculation:- Balance
Sum output column:- SumofBalance
And output you will get as :-
Below is the design:-
Seqfile----> ColGen---->Aggregator--->SeqFile
Generate DUMMY column with value 1.
In Aggregator perform below derivation:-
Grouping Key:- AgeGroup
Aggregation Type :- Calculation
Column For calculation:- DUMMY
Sum Output Column:- Count
Column For Calculation:- Balance
Sum output column:- SumofBalance
And output you will get as :-
Code: Select all
18-20,2, 0300.00
20-25,3, 6000.00
30-35,1, 0300.00
Thanks
Prasoon
ETL Consultant
LinkedIn :- http://www.linkedin.com/profile/view?id ... ab_pro_top
Blog:- http://dsshar.blogspot.com/
Prasoon
ETL Consultant
LinkedIn :- http://www.linkedin.com/profile/view?id ... ab_pro_top
Blog:- http://dsshar.blogspot.com/
-
- Premium Member
- Posts: 730
- Joined: Tue Nov 04, 2008 10:14 am
- Location: Bangalore
-
- Premium Member
- Posts: 353
- Joined: Mon Jan 17, 2011 5:03 am
- Location: Mumbai, India
@Zulfi/Prasson,
Thanks for your replies. I got what you are suggesting, but as I said these are round way solutions and I wanted a straight forward solution like enabling some properties or using Environmental variable.
I have tried one more way where I generate a column "CODE" and for every different age group the value is incremented by 1.
My source is a database so I can use it in my query.
And finally use this column for SUM/COUNT.
But I realize now that this is the only feasible solution possible. So thanks guys for your suggestions.
Thanks for your replies. I got what you are suggesting, but as I said these are round way solutions and I wanted a straight forward solution like enabling some properties or using Environmental variable.
I have tried one more way where I generate a column "CODE" and for every different age group the value is incremented by 1.
My source is a database so I can use it in my query.
And finally use this column for SUM/COUNT.
But I realize now that this is the only feasible solution possible. So thanks guys for your suggestions.
Thanx and Regards,
ETL User
ETL User
-
- Premium Member
- Posts: 730
- Joined: Tue Nov 04, 2008 10:14 am
- Location: Bangalore
Suppose you have 10 million records (hypothetical example)chandra.shekhar@tcs.com wrote:I have tried one more way where I generate a column "CODE" and for every different age group the value is incremented by 1.
My source is a database so I can use it in my query.
And finally use this column for SUM
And if your database resides on remote server you are adding 40,00,000 bytes to the network I/O assuming you have typed it as integer which could have been avoided if a colum generator was used to generate the same column also if you have multiple stages between the database stage and aggregator this column adds to the data volume which could be avoided.
- Zulfi
-
- Premium Member
- Posts: 353
- Joined: Mon Jan 17, 2011 5:03 am
- Location: Mumbai, India
I am dealing here with 520 million records
. I join two sets of 260 million records each and then pass it to Aggregator. My minimum throughput I get is around 200000 rows/sec.
But my database is not on a remote server and my job finishes within 30 mins. Performance is not an issue for me.
But the solution you have provided is only for counting the rows. I want AgeGroup wise summation also. If I keep a static DUMMY value then Sum is not possible so that's why for each AgeGroup the dummy column should have different value.
![Wink :wink:](./images/smilies/icon_wink.gif)
But my database is not on a remote server and my job finishes within 30 mins. Performance is not an issue for me.
But the solution you have provided is only for counting the rows. I want AgeGroup wise summation also. If I keep a static DUMMY value then Sum is not possible so that's why for each AgeGroup the dummy column should have different value.
Thanx and Regards,
ETL User
ETL User
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Wrong. You can sum(column) and sum(dummy) in the same Aggregator using the same grouping. If your data are sorted by the grouping fields and you specify Group as the aggregation mode, then your job will be maximally performant.
sum(dummy) is the equivalent of COUNT if dummy=1
sum(dummy) is the equivalent of COUNT if dummy=1
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.