Sum, Count on varchar in Aggregator

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

chandra.shekhar@tcs.com
Premium Member
Premium Member
Posts: 353
Joined: Mon Jan 17, 2011 5:03 am
Location: Mumbai, India

Sum, Count on varchar in Aggregator

Post by chandra.shekhar@tcs.com »

Hi All,
I want to calculate SUM and COUNT on varchar field in Aggregator.
I have searched many posts and found out that "Preserve Type" option is present but it works only for MIN and MAX functions.
Can any one suggest how to accomplish this ?
Thanx and Regards,
ETL User
srinivas.nettalam
Participant
Posts: 134
Joined: Tue Jun 15, 2010 2:10 am
Location: Bangalore

Re: Sum, Count on varchar in Aggregator

Post by srinivas.nettalam »

It works for even COUNT but am not sure of SUM.
N.Srinivas
India.
BI-RMA
Premium Member
Premium Member
Posts: 463
Joined: Sun Nov 01, 2009 3:55 pm
Location: Hamburg

Post by BI-RMA »

Hi Chandra,

It is impossible to perform a count over a field in DataStage, you can only count rows and the result is always a double.

It is generally impossible to determine the sum of a character string, except the string can be implicitly converted to a numeric value. The result of generating a sum output-column in DataStage is by default a double as well. You may choose to switch this to decimal with a defined precision and scale. But the result of a column derived by sum or count will never be a Varchar. You'll have to do that manually afterwards if needed.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
srinivas.nettalam
Participant
Posts: 134
Joined: Tue Jun 15, 2010 2:10 am
Location: Bangalore

Re: Sum, Count on varchar in Aggregator

Post by srinivas.nettalam »

My mistake I mean COUNT on the rows only,but not on a field.
N.Srinivas
India.
chandra.shekhar@tcs.com
Premium Member
Premium Member
Posts: 353
Joined: Mon Jan 17, 2011 5:03 am
Location: Mumbai, India

Post by chandra.shekhar@tcs.com »

Thanks for the reply.
Actually the SUM is on a decimal field but the group column is a varchar(like a name). I have tried a lot of things but getting errors

Code: Select all

Error when checking operator: It is not legal to set
a reducer (AGE_GRP) to be the same as a key field,
when the type of that field (int32) is not a double.
For the above error too, I have not found a specific solution , only round way solution.
@Sreenivas
My mistake I mean COUNT on the rows only,but not on a field.
Even for COUNT I have to specify a column in Aggregator, that too is a varchar column.
Thanx and Regards,
ETL User
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Why don't you show us your exact working? Grouping columns, calculation method, output column names and data types, and calculation functions.

It is not permitted to sum a VarChar. Nor is it meaningful. Convert to appropriate numeric data type upstream of the Aggregator if the string contains only numbers.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chandra.shekhar@tcs.com
Premium Member
Premium Member
Posts: 353
Joined: Mon Jan 17, 2011 5:03 am
Location: Mumbai, India

Post by chandra.shekhar@tcs.com »

Sorry Ray, not able to view your whole response. But got the idea so here it is; My data is like

Code: Select all

AgeGroup Balance
18-20    100.0
18-20    200.0
20-25    1000.0
20-25    2000.0
20-25    3000.0
30-35    300.0
O/P data will look like

Code: Select all

AgeGroup Count SumOfBalance
18-20     2    300.0
20-25     3    6000.0
30-35     1    300.0
Now I have tried two methods.
1) Using only one Aggregator,
Group = AgeGroup
Group Type = Calculation
Column for Calculation = AgeGroup
Non Missing Values Count O/P Column = CNT
Column for Calculation = Balance
Sum O/P column = SumOfBalance

Here I am calculating the count and sum in a single aggregator and the grouping is on AgeGroup(VARCHAR 10).

2) Using two aggegators, one for calculation count and other for summation(using fork join)

In both the methods I am getting error which I have mentioned above.
I know a round way solution but dont want to use it. Can there be a direct solution ?
Thanx and Regards,
ETL User
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

Post by zulfi123786 »

how about having a column generator to generate a column with '1' as its value and sum it up for record count ?
- Zulfi
chandra.shekhar@tcs.com
Premium Member
Premium Member
Posts: 353
Joined: Mon Jan 17, 2011 5:03 am
Location: Mumbai, India

Post by chandra.shekhar@tcs.com »

@Zulfi
Can you elaborate a bit ?
Thanx and Regards,
ETL User
prasson_ibm
Premium Member
Premium Member
Posts: 536
Joined: Thu Oct 11, 2007 1:48 am
Location: Bangalore

Post by prasson_ibm »

Hi Chandra,

Below is the design:-

Seqfile----> ColGen---->Aggregator--->SeqFile

Generate DUMMY column with value 1.

In Aggregator perform below derivation:-

Grouping Key:- AgeGroup

Aggregation Type :- Calculation

Column For calculation:- DUMMY
Sum Output Column:- Count
Column For Calculation:- Balance
Sum output column:- SumofBalance

And output you will get as :-

Code: Select all

18-20,2, 0300.00
20-25,3, 6000.00
30-35,1, 0300.00
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

Post by zulfi123786 »

In the column generator go to the colum options for the generated column and select type as cycle with initial value 1 and increment 0. This will ensure every record has the same 1 for this colum and as posted sum it up in the aggregator
- Zulfi
chandra.shekhar@tcs.com
Premium Member
Premium Member
Posts: 353
Joined: Mon Jan 17, 2011 5:03 am
Location: Mumbai, India

Post by chandra.shekhar@tcs.com »

@Zulfi/Prasson,
Thanks for your replies. I got what you are suggesting, but as I said these are round way solutions and I wanted a straight forward solution like enabling some properties or using Environmental variable.
I have tried one more way where I generate a column "CODE" and for every different age group the value is incremented by 1.
My source is a database so I can use it in my query.
And finally use this column for SUM/COUNT.
But I realize now that this is the only feasible solution possible. So thanks guys for your suggestions.
Thanx and Regards,
ETL User
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

Post by zulfi123786 »

chandra.shekhar@tcs.com wrote:I have tried one more way where I generate a column "CODE" and for every different age group the value is incremented by 1.
My source is a database so I can use it in my query.
And finally use this column for SUM
Suppose you have 10 million records (hypothetical example)
And if your database resides on remote server you are adding 40,00,000 bytes to the network I/O assuming you have typed it as integer which could have been avoided if a colum generator was used to generate the same column also if you have multiple stages between the database stage and aggregator this column adds to the data volume which could be avoided.
- Zulfi
chandra.shekhar@tcs.com
Premium Member
Premium Member
Posts: 353
Joined: Mon Jan 17, 2011 5:03 am
Location: Mumbai, India

Post by chandra.shekhar@tcs.com »

I am dealing here with 520 million records :wink: . I join two sets of 260 million records each and then pass it to Aggregator. My minimum throughput I get is around 200000 rows/sec.
But my database is not on a remote server and my job finishes within 30 mins. Performance is not an issue for me.
But the solution you have provided is only for counting the rows. I want AgeGroup wise summation also. If I keep a static DUMMY value then Sum is not possible so that's why for each AgeGroup the dummy column should have different value.
Thanx and Regards,
ETL User
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Wrong. You can sum(column) and sum(dummy) in the same Aggregator using the same grouping. If your data are sorted by the grouping fields and you specify Group as the aggregation mode, then your job will be maximally performant.

sum(dummy) is the equivalent of COUNT if dummy=1
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply