Aggregator stage performance
Moderators: chulett, rschirm, roy
Aggregator stage performance
hi all,
my datastage job aggregates about 450000 records into 210 groups with 300 summed output columns.
The outcome of this aggregation is fine but performance is very poor. Running the job takes 45 minutes to finish. I tried to sort the data on the grouping column before aggregating, but without succes. The jobtime was even minutes longer. In an older version of this job I used a SQL group by_statement within sybase OC stage to do the aggregation. The performance then was much better (few minutes) than my current job. Since I now work with datasets as input, the former sybase solution cannot be used.
Is there a way to speed up the aggregation?
Thanks in advance.
Peter
my datastage job aggregates about 450000 records into 210 groups with 300 summed output columns.
The outcome of this aggregation is fine but performance is very poor. Running the job takes 45 minutes to finish. I tried to sort the data on the grouping column before aggregating, but without succes. The jobtime was even minutes longer. In an older version of this job I used a SQL group by_statement within sybase OC stage to do the aggregation. The performance then was much better (few minutes) than my current job. Since I now work with datasets as input, the former sybase solution cannot be used.
Is there a way to speed up the aggregation?
Thanks in advance.
Peter
If you can sort your incoming data stream to the aggregator on the columns you use in that stage and then specify in the aggregator that the columns are sorted you will see that the stage performs blazingly fast. This is because it no longer needs to store data temporarily, it just sums/aggregates values and when a group change occurs it outputs the result.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 222
- Joined: Tue Aug 30, 2005 2:07 am
- Location: pune
- Contact:
Hi,
Adding few more points to the Arnd solution, perofrm HASH partition in the Sort stage before aggregator stage on grouping keys that you are using in the Aggregator stage. Your data will sort properly based on group keys and your job performance will increase.
Adding few more points to the Arnd solution, perofrm HASH partition in the Sort stage before aggregator stage on grouping keys that you are using in the Aggregator stage. Your data will sort properly based on group keys and your job performance will increase.
NageshSunkoji
If you know anything SHARE it.............
If you Don't know anything LEARN it...............
If you know anything SHARE it.............
If you Don't know anything LEARN it...............
-
- Participant
- Posts: 222
- Joined: Tue Aug 30, 2005 2:07 am
- Location: pune
- Contact:
Hi tejaswini,tejaswini wrote:among the columns which you are aggregating, are there any decimal columns?
May I know, how it make difference with the performance if we have the decimal fields in the aggregation key. Can you explain the same, please.
NageshSunkoji
If you know anything SHARE it.............
If you Don't know anything LEARN it...............
If you know anything SHARE it.............
If you Don't know anything LEARN it...............
-
- Participant
- Posts: 222
- Joined: Tue Aug 30, 2005 2:07 am
- Location: pune
- Contact:
Hi Peter,PeterPol wrote:It doesn't seem to help (still 48 minutes), or may be I misunderstood your advise...
I added a sort stage before the aggregator stage to sort input on the aggregator grouping key. In both the sort and aggregator stage I selected hash partitioning.
Peter
Its not required to perform HASH partition in both Sort stage and Aggregator stage. Just perform HASH partition in the sort stage and then select partitioning type as 'SAME' in the aggregator stage. Just try out this and let us know u r statistics.
NageshSunkoji
If you know anything SHARE it.............
If you Don't know anything LEARN it...............
If you know anything SHARE it.............
If you Don't know anything LEARN it...............
Nagesh,
If there are any decimal type columns, then we can add an environmental variable 'APT_DECIMAL_INTERM_PRECISION' and set the value to 2. by default it is 38. by this we are limiting the internal calculations to use only 2 precision, which might speed up the job.
PerterPol,
It is enough to do hash partitioning in 'SORT' stage alone. and keep 'SAME' partitioning type in aggregator. no need to do hash partitioning again in aggregator.
If there are any decimal type columns, then we can add an environmental variable 'APT_DECIMAL_INTERM_PRECISION' and set the value to 2. by default it is 38. by this we are limiting the internal calculations to use only 2 precision, which might speed up the job.
PerterPol,
It is enough to do hash partitioning in 'SORT' stage alone. and keep 'SAME' partitioning type in aggregator. no need to do hash partitioning again in aggregator.
-
- Participant
- Posts: 222
- Joined: Tue Aug 30, 2005 2:07 am
- Location: pune
- Contact:
Hi Tejaswini,tejaswini wrote:Nagesh,
If there are any decimal type columns, then we can add an environmental variable 'APT_DECIMAL_INTERM_PRECISION' and set the value to 2. by default it is 38. by this we are limiting the internal calculations to use only 2 precision, which might speed up the job.
PerterPol,
It is enough to do hash partitioning in 'SORT' stage alone. and keep 'SAME' partitioning type in aggregator. no need to do hash partitioning again in aggregator.
I am sorry to say, your explanation is not giving proper information. The environmental variables APT_DECIMAL_INTERM_PRECISION and APT_DECIMAL_INTERM_SCALE are very helpful to you, if your data have the more than precision 38 and scale 10. This environmental variables are their to override the default values. If you reduce the Precision to 2, then your result is not proper and it is very wrong, though your are reducing the original vaules.
NageshSunkoji
If you know anything SHARE it.............
If you Don't know anything LEARN it...............
If you know anything SHARE it.............
If you Don't know anything LEARN it...............
Recall that there were 2 steps involved. The first was sorting.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Are the data pre-sorted when they are written to the Data Sets? If so, include an explicit Sort stage ahead of the Aggregator, and specifically make the sort rule for the key columns "Don't Sort (Previously Sorted)". This will prevent DataStage from inserting tsort operators.
View the score to see these inserted operators.
View the score to see these inserted operators.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Thats not enough. You have to specify in the Aggregator stage that the data has already been sorted.PeterPol wrote:hi
I added a sort stage and adjusted the partition type to same in the aggregator. Isn't that enough?
Peter
Also set the Method property in the Aggregator stage to "Sort", by default it is Hash.
The Bird.