Counts and sums in Datastage with out 'Aggregator' stage
Moderators: chulett, rschirm, roy
Counts and sums in Datastage with out 'Aggregator' stage
Hi all
The scenario is as below.
key1,key2,desc
-------------------
a,b,emergency_room
a,b,emergency_room
a,b,emergency_room
m,n,emergency_room
m,n,emergency_room
x,y,emergency_room
output
key1,key2,count
----------------
a,b,3
m,n,2
x,y,1
Can this be done in datastage?
I was asked not use Aggregator stage and and to use 'group by' in my sql.
as the data vol. is very huge.
The scenario is as below.
key1,key2,desc
-------------------
a,b,emergency_room
a,b,emergency_room
a,b,emergency_room
m,n,emergency_room
m,n,emergency_room
x,y,emergency_room
output
key1,key2,count
----------------
a,b,3
m,n,2
x,y,1
Can this be done in datastage?
I was asked not use Aggregator stage and and to use 'group by' in my sql.
as the data vol. is very huge.
Thank you,
Anu
Anu
I did not get that. You can use group by or you cannot. You can do the counts in transformer using stage varialbes. Discussed zillions of times. Search is your best friend
Like this for example.
Like this for example.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
thanks DSguru2B and Kumar.
In real scenario, I have 8 columns to group by. And the data is like 500 mil. per each run.
I have been trying with stage variables. But wondering how to retain previous key col values to check with incoming val. every time.
I created 2 stage variables like svPrevRow and svCount with initial values as 0
each incoming row is being checked against svPrevRow and if they are equal then increase the count.
do it till you get a different set of keys. that time we have to write the svPrevRow and svCount as output and do the same for the new set of keys.
But I am doing some thing wrong here.. I am not getting correct out put. could some one help me out with this...please.
thanks a lot for your valuable time.
In real scenario, I have 8 columns to group by. And the data is like 500 mil. per each run.
I have been trying with stage variables. But wondering how to retain previous key col values to check with incoming val. every time.
I created 2 stage variables like svPrevRow and svCount with initial values as 0
each incoming row is being checked against svPrevRow and if they are equal then increase the count.
do it till you get a different set of keys. that time we have to write the svPrevRow and svCount as output and do the same for the new set of keys.
But I am doing some thing wrong here.. I am not getting correct out put. could some one help me out with this...please.
thanks a lot for your valuable time.
kumar_s wrote:If you need to output in this format,
a,b,3
m,n,2
x,y,1
The better way is to group by from you soruce SQL. If key1 and key2 are indexed, SQL should me much more effecient.
Thank you,
Anu
Anu
thanks Guru. I checked the embedded post and tried to implement that logic. But in my case, I should not be writing each row to output but count them in a loop and write to output along with count when you get new set of keys.DSguru2B wrote:Did you check the embeded post in my previous reply? Vincent has given the complete code on how to get it done.
I used same logic. But i got all the rows in output with a count '1' against each row.
Thank you,
Anu
Anu
There must be something you are doing wrong. Do us a favour and post a few lines of your actual file with exact number of columns and post your stage variable derivations. Lets diagnose where your slipping away, shall we.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
thanks again for your time guru..
Input
key1,key2,key3,desc
-------------------
100,200,300,er
100,200,300,er
100,200,300,er
111,222,333,er
111,222,333,er
Output should be
key1,key2,key3,count
--------------------
100,200,300,3
111,222,333,2
Input is coming from a table.There are a few other keys along with above three (i.e 100,200.300 ) to make each row unique in the source table.But i need the count of each set of(i.e 100,200.300 ) records where desc = 'er'. Just assume we have 'er' as desc for all incoming rows.
and my stage variables are as below...
sv_name sv_derivation
------------------------------------------------------------------------------------
svCnt --- if input.key1:inputinput.key3 <> svPrevRow Then 1 Else svCnt + 1
svPrevRow -- input.key1:inputinput.key3
Input
key1,key2,key3,desc
-------------------
100,200,300,er
100,200,300,er
100,200,300,er
111,222,333,er
111,222,333,er
Output should be
key1,key2,key3,count
--------------------
100,200,300,3
111,222,333,2
Input is coming from a table.There are a few other keys along with above three (i.e 100,200.300 ) to make each row unique in the source table.But i need the count of each set of(i.e 100,200.300 ) records where desc = 'er'. Just assume we have 'er' as desc for all incoming rows.
and my stage variables are as below...
sv_name sv_derivation
------------------------------------------------------------------------------------
svCnt --- if input.key1:inputinput.key3 <> svPrevRow Then 1 Else svCnt + 1
svPrevRow -- input.key1:inputinput.key3
DSguru2B wrote:There must be something you are doing wrong. Do us a favour and post a few lines of your actual file with exact number of columns and post your stage variable derivations. Lets diagnose where your slipping away, shall we.
Thank you,
Anu
Anu
What you are doing is correct but you need to add another step. Currently if you feed your input (the one you mentioned in your prev. post) to the above logic, your output will look like
You need to pass it through remove duplicate stage to retain the Last duplicate.
Then again you will have to process in sequential mode or partition on keys. Also make sure your input is sorted, very important. The easiest way, use the aggregator. If I were you, i would revert back to whom ever put that limit on me and demand an explanation.
Code: Select all
key1,key2,key3,desc, Count
-------------------------
100,200,300,er,1
100,200,300,er,2
100,200,300,er,3
111,222,333,er,1
111,222,333,er,2
Then again you will have to process in sequential mode or partition on keys. Also make sure your input is sorted, very important. The easiest way, use the aggregator. If I were you, i would revert back to whom ever put that limit on me and demand an explanation.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
I am getting an output as below..
key1,key2,key3,desc, Count
-------------------------
100,200,300,er,1
100,200,300,er,1
100,200,300,er,1
111,222,333,er,1
111,222,333,er,1
They had bad experiences with 'Agg' stage in 7.5.1 PX especially when dealing with 500 millions (or more) of data in each run.I was told the output were not accurate OR not consistant.
No way of using 'Agg' stage here.
key1,key2,key3,desc, Count
-------------------------
100,200,300,er,1
100,200,300,er,1
100,200,300,er,1
111,222,333,er,1
111,222,333,er,1
They had bad experiences with 'Agg' stage in 7.5.1 PX especially when dealing with 500 millions (or more) of data in each run.I was told the output were not accurate OR not consistant.
No way of using 'Agg' stage here.
DSguru2B wrote:What you are doing is correct but you need to add another step. Currently if you feed your input (the one you mentioned in your prev. post) to the above logic, your output will look likeYou need to pass it through remove duplicate stage to retain the Last duplicate.Code: Select all
key1,key2,key3,desc, Count ------------------------- 100,200,300,er,1 100,200,300,er,2 100,200,300,er,3 111,222,333,er,1 111,222,333,er,2
Then again you will have to process in sequential mode or partition on keys. Also make sure your input is sorted, very important. The easiest way, use the aggregator. If I were you, i would revert back to whom ever put that limit on me and demand an explanation.
Thank you,
Anu
Anu
I would also suggest to go with Aggregator stage. Usage staging variable will almost takes same time.
If no other go you can go ahead with DSguru suggestion, partition on key1, key2 and key3 and sort in the input link of the transformer. Introduce a RDUP stage after transformer to retain the last record.
Cheers - Madhav
If no other go you can go ahead with DSguru suggestion, partition on key1, key2 and key3 and sort in the input link of the transformer. Introduce a RDUP stage after transformer to retain the last record.
Cheers - Madhav
Thank you.
Is there any way that we can SUM the $amounts too in transformer.for example in the same above example we have 5th col. $amt. We need to SUM($amt) from keys '100,200,300' where Desc = 'er'.
Is there any way that we can SUM the $amounts too in transformer.for example in the same above example we have 5th col. $amt. We need to SUM($amt) from keys '100,200,300' where Desc = 'er'.
Madhav_M wrote:I would also suggest to go with Aggregator stage. Usage staging variable will almost takes same time.
If no other go you can go ahead with DSguru suggestion, partition on key1, key2 and key3 and sort in the input link of the transformer. Introduce a RDUP stage after transformer to retain the last record.
Cheers - Madhav
Thank you,
Anu
Anu
Yes you can. Add another stage variable say svSum. Put it after svCnt and before svPrevRow. Its derivation will be
Then again, if your running in sequential mode it will work, otherwise you need to partition on your keys.
Code: Select all
if input.key1:input:key2:input.key3 <> svPrevRow Then svSum Else svCnt + in.Amount
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
Guru:
could you please check my previous to previous post. I am not getting the output as you mentioned.
thanks in advance.
could you please check my previous to previous post. I am not getting the output as you mentioned.
thanks in advance.
DSguru2B wrote:Yes you can. Add another stage variable say svSum. Put it after svCnt and before svPrevRow. Its derivation will be
Then again, if your running in sequential mode it will work, otherwise you need to partition on your keys.Code: Select all
if input.key1:input:key2:input.key3 <> svPrevRow Then svSum Else svCnt + in.Amount
Thank you,
Anu
Anu