Page 1 of 2

Sum, Count on varchar in Aggregator

Posted: Thu Jan 31, 2013 11:37 pm
by chandra.shekhar@tcs.com
Hi All,
I want to calculate SUM and COUNT on varchar field in Aggregator.
I have searched many posts and found out that "Preserve Type" option is present but it works only for MIN and MAX functions.
Can any one suggest how to accomplish this ?

Re: Sum, Count on varchar in Aggregator

Posted: Fri Feb 01, 2013 2:43 am
by srinivas.nettalam
It works for even COUNT but am not sure of SUM.

Posted: Fri Feb 01, 2013 3:17 am
by BI-RMA
Hi Chandra,

It is impossible to perform a count over a field in DataStage, you can only count rows and the result is always a double.

It is generally impossible to determine the sum of a character string, except the string can be implicitly converted to a numeric value. The result of generating a sum output-column in DataStage is by default a double as well. You may choose to switch this to decimal with a defined precision and scale. But the result of a column derived by sum or count will never be a Varchar. You'll have to do that manually afterwards if needed.

Re: Sum, Count on varchar in Aggregator

Posted: Fri Feb 01, 2013 3:32 am
by srinivas.nettalam
My mistake I mean COUNT on the rows only,but not on a field.

Posted: Fri Feb 01, 2013 3:54 am
by chandra.shekhar@tcs.com
Thanks for the reply.
Actually the SUM is on a decimal field but the group column is a varchar(like a name). I have tried a lot of things but getting errors

Code: Select all

Error when checking operator: It is not legal to set
a reducer (AGE_GRP) to be the same as a key field,
when the type of that field (int32) is not a double.
For the above error too, I have not found a specific solution , only round way solution.
@Sreenivas
My mistake I mean COUNT on the rows only,but not on a field.
Even for COUNT I have to specify a column in Aggregator, that too is a varchar column.

Posted: Fri Feb 01, 2013 4:30 am
by ray.wurlod
Why don't you show us your exact working? Grouping columns, calculation method, output column names and data types, and calculation functions.

It is not permitted to sum a VarChar. Nor is it meaningful. Convert to appropriate numeric data type upstream of the Aggregator if the string contains only numbers.

Posted: Fri Feb 01, 2013 4:54 am
by chandra.shekhar@tcs.com
Sorry Ray, not able to view your whole response. But got the idea so here it is; My data is like

Code: Select all

AgeGroup Balance
18-20    100.0
18-20    200.0
20-25    1000.0
20-25    2000.0
20-25    3000.0
30-35    300.0
O/P data will look like

Code: Select all

AgeGroup Count SumOfBalance
18-20     2    300.0
20-25     3    6000.0
30-35     1    300.0
Now I have tried two methods.
1) Using only one Aggregator,
Group = AgeGroup
Group Type = Calculation
Column for Calculation = AgeGroup
Non Missing Values Count O/P Column = CNT
Column for Calculation = Balance
Sum O/P column = SumOfBalance

Here I am calculating the count and sum in a single aggregator and the grouping is on AgeGroup(VARCHAR 10).

2) Using two aggegators, one for calculation count and other for summation(using fork join)

In both the methods I am getting error which I have mentioned above.
I know a round way solution but dont want to use it. Can there be a direct solution ?

Posted: Sat Feb 02, 2013 1:48 pm
by zulfi123786
how about having a column generator to generate a column with '1' as its value and sum it up for record count ?

Posted: Mon Feb 04, 2013 1:08 am
by chandra.shekhar@tcs.com
@Zulfi
Can you elaborate a bit ?

Posted: Mon Feb 04, 2013 3:47 am
by prasson_ibm
Hi Chandra,

Below is the design:-

Seqfile----> ColGen---->Aggregator--->SeqFile

Generate DUMMY column with value 1.

In Aggregator perform below derivation:-

Grouping Key:- AgeGroup

Aggregation Type :- Calculation

Column For calculation:- DUMMY
Sum Output Column:- Count
Column For Calculation:- Balance
Sum output column:- SumofBalance

And output you will get as :-

Code: Select all

18-20,2, 0300.00
20-25,3, 6000.00
30-35,1, 0300.00

Posted: Mon Feb 04, 2013 11:48 am
by zulfi123786
In the column generator go to the colum options for the generated column and select type as cycle with initial value 1 and increment 0. This will ensure every record has the same 1 for this colum and as posted sum it up in the aggregator

Posted: Tue Feb 05, 2013 12:02 am
by chandra.shekhar@tcs.com
@Zulfi/Prasson,
Thanks for your replies. I got what you are suggesting, but as I said these are round way solutions and I wanted a straight forward solution like enabling some properties or using Environmental variable.
I have tried one more way where I generate a column "CODE" and for every different age group the value is incremented by 1.
My source is a database so I can use it in my query.
And finally use this column for SUM/COUNT.
But I realize now that this is the only feasible solution possible. So thanks guys for your suggestions.

Posted: Tue Feb 05, 2013 12:50 pm
by zulfi123786
chandra.shekhar@tcs.com wrote:I have tried one more way where I generate a column "CODE" and for every different age group the value is incremented by 1.
My source is a database so I can use it in my query.
And finally use this column for SUM
Suppose you have 10 million records (hypothetical example)
And if your database resides on remote server you are adding 40,00,000 bytes to the network I/O assuming you have typed it as integer which could have been avoided if a colum generator was used to generate the same column also if you have multiple stages between the database stage and aggregator this column adds to the data volume which could be avoided.

Posted: Tue Feb 05, 2013 11:45 pm
by chandra.shekhar@tcs.com
I am dealing here with 520 million records :wink: . I join two sets of 260 million records each and then pass it to Aggregator. My minimum throughput I get is around 200000 rows/sec.
But my database is not on a remote server and my job finishes within 30 mins. Performance is not an issue for me.
But the solution you have provided is only for counting the rows. I want AgeGroup wise summation also. If I keep a static DUMMY value then Sum is not possible so that's why for each AgeGroup the dummy column should have different value.

Posted: Tue Feb 05, 2013 11:53 pm
by ray.wurlod
Wrong. You can sum(column) and sum(dummy) in the same Aggregator using the same grouping. If your data are sorted by the grouping fields and you specify Group as the aggregation mode, then your job will be maximally performant.

sum(dummy) is the equivalent of COUNT if dummy=1