Page 1 of 2

Aggregator stage performance

Posted: Wed Nov 22, 2006 3:38 am
by PeterPol
hi all,

my datastage job aggregates about 450000 records into 210 groups with 300 summed output columns.

The outcome of this aggregation is fine but performance is very poor. Running the job takes 45 minutes to finish. I tried to sort the data on the grouping column before aggregating, but without succes. The jobtime was even minutes longer. In an older version of this job I used a SQL group by_statement within sybase OC stage to do the aggregation. The performance then was much better (few minutes) than my current job. Since I now work with datasets as input, the former sybase solution cannot be used.

Is there a way to speed up the aggregation?


Thanks in advance.

Peter

Posted: Wed Nov 22, 2006 3:54 am
by ArndW
If you can sort your incoming data stream to the aggregator on the columns you use in that stage and then specify in the aggregator that the columns are sorted you will see that the stage performs blazingly fast. This is because it no longer needs to store data temporarily, it just sums/aggregates values and when a group change occurs it outputs the result.

Posted: Wed Nov 22, 2006 3:58 am
by Nageshsunkoji
Hi,

Adding few more points to the Arnd solution, perofrm HASH partition in the Sort stage before aggregator stage on grouping keys that you are using in the Aggregator stage. Your data will sort properly based on group keys and your job performance will increase.

Posted: Wed Nov 22, 2006 5:51 am
by tejaswini
among the columns which you are aggregating, are there any decimal columns?

Posted: Wed Nov 22, 2006 6:00 am
by Nageshsunkoji
tejaswini wrote:among the columns which you are aggregating, are there any decimal columns?
Hi tejaswini,

May I know, how it make difference with the performance if we have the decimal fields in the aggregation key. Can you explain the same, please.

Posted: Wed Nov 22, 2006 6:00 am
by PeterPol
It doesn't seem to help (still 48 minutes), or may be I misunderstood your advise...

I added a sort stage before the aggregator stage to sort input on the aggregator grouping key. In both the sort and aggregator stage I selected hash partitioning.

Peter

Posted: Wed Nov 22, 2006 6:04 am
by Nageshsunkoji
PeterPol wrote:It doesn't seem to help (still 48 minutes), or may be I misunderstood your advise...

I added a sort stage before the aggregator stage to sort input on the aggregator grouping key. In both the sort and aggregator stage I selected hash partitioning.

Peter
Hi Peter,

Its not required to perform HASH partition in both Sort stage and Aggregator stage. Just perform HASH partition in the sort stage and then select partitioning type as 'SAME' in the aggregator stage. Just try out this and let us know u r statistics.

Posted: Wed Nov 22, 2006 6:07 am
by PeterPol
The aggregation key is a varchar(11). The 300 output columns in the aggregator stage are integers, and decimals (most with precision: 15,3)

Peter

Posted: Wed Nov 22, 2006 6:11 am
by tejaswini
Nagesh,

If there are any decimal type columns, then we can add an environmental variable 'APT_DECIMAL_INTERM_PRECISION' and set the value to 2. by default it is 38. by this we are limiting the internal calculations to use only 2 precision, which might speed up the job.

PerterPol,
It is enough to do hash partitioning in 'SORT' stage alone. and keep 'SAME' partitioning type in aggregator. no need to do hash partitioning again in aggregator.

Posted: Wed Nov 22, 2006 7:29 am
by Nageshsunkoji
tejaswini wrote:Nagesh,

If there are any decimal type columns, then we can add an environmental variable 'APT_DECIMAL_INTERM_PRECISION' and set the value to 2. by default it is 38. by this we are limiting the internal calculations to use only 2 precision, which might speed up the job.

PerterPol,
It is enough to do hash partitioning in 'SORT' stage alone. and keep 'SAME' partitioning type in aggregator. no need to do hash partitioning again in aggregator.
Hi Tejaswini,

I am sorry to say, your explanation is not giving proper information. The environmental variables APT_DECIMAL_INTERM_PRECISION and APT_DECIMAL_INTERM_SCALE are very helpful to you, if your data have the more than precision 38 and scale 10. This environmental variables are their to override the default values. If you reduce the Precision to 2, then your result is not proper and it is very wrong, though your are reducing the original vaules.

Posted: Wed Nov 22, 2006 7:50 am
by PeterPol
Sorting the data on the aggr key before aggregating performs no better...

Peter

Posted: Wed Nov 22, 2006 7:54 am
by ArndW
Recall that there were 2 steps involved. The first was sorting.

Posted: Wed Nov 22, 2006 8:04 am
by ray.wurlod
Are the data pre-sorted when they are written to the Data Sets? If so, include an explicit Sort stage ahead of the Aggregator, and specifically make the sort rule for the key columns "Don't Sort (Previously Sorted)". This will prevent DataStage from inserting tsort operators.

View the score to see these inserted operators.

Posted: Wed Nov 22, 2006 8:05 am
by PeterPol
hi

I added a sort stage and adjusted the partition type to same in the aggregator. Isn't that enough?

Peter

Posted: Wed Nov 22, 2006 8:47 am
by thebird
PeterPol wrote:hi

I added a sort stage and adjusted the partition type to same in the aggregator. Isn't that enough?

Peter
Thats not enough. You have to specify in the Aggregator stage that the data has already been sorted.

Also set the Method property in the Aggregator stage to "Sort", by default it is Hash.

The Bird.