Aggregator stage performance

PeterPol · Post by **PeterPol** » Wed Nov 22, 2006 3:38 am

hi all,

my datastage job aggregates about 450000 records into 210 groups with 300 summed output columns.

The outcome of this aggregation is fine but performance is very poor. Running the job takes 45 minutes to finish. I tried to sort the data on the grouping column before aggregating, but without succes. The jobtime was even minutes longer. In an older version of this job I used a SQL group by_statement within sybase OC stage to do the aggregation. The performance then was much better (few minutes) than my current job. Since I now work with datasets as input, the former sybase solution cannot be used.

Is there a way to speed up the aggregation?

Thanks in advance.

Peter

ArndW · Post by **ArndW** » Wed Nov 22, 2006 3:54 am

If you can sort your incoming data stream to the aggregator on the columns you use in that stage and then specify in the aggregator that the columns are sorted you will see that the stage performs blazingly fast. This is because it no longer needs to store data temporarily, it just sums/aggregates values and when a group change occurs it outputs the result.

Nageshsunkoji · Post by **Nageshsunkoji** » Wed Nov 22, 2006 3:58 am

Hi,

Adding few more points to the Arnd solution, perofrm HASH partition in the Sort stage before aggregator stage on grouping keys that you are using in the Aggregator stage. Your data will sort properly based on group keys and your job performance will increase.

tejaswini · Post by **tejaswini** » Wed Nov 22, 2006 5:51 am

among the columns which you are aggregating, are there any decimal columns?

Nageshsunkoji · Post by **Nageshsunkoji** » Wed Nov 22, 2006 6:00 am

tejaswini wrote:among the columns which you are aggregating, are there any decimal columns?

Hi tejaswini,

May I know, how it make difference with the performance if we have the decimal fields in the aggregation key. Can you explain the same, please.

PeterPol · Post by **PeterPol** » Wed Nov 22, 2006 6:00 am

It doesn't seem to help (still 48 minutes), or may be I misunderstood your advise...

I added a sort stage before the aggregator stage to sort input on the aggregator grouping key. In both the sort and aggregator stage I selected hash partitioning.

Peter

Nageshsunkoji · Post by **Nageshsunkoji** » Wed Nov 22, 2006 6:04 am

PeterPol wrote:It doesn't seem to help (still 48 minutes), or may be I misunderstood your advise...

I added a sort stage before the aggregator stage to sort input on the aggregator grouping key. In both the sort and aggregator stage I selected hash partitioning.

Peter

Hi Peter,

Its not required to perform HASH partition in both Sort stage and Aggregator stage. Just perform HASH partition in the sort stage and then select partitioning type as 'SAME' in the aggregator stage. Just try out this and let us know u r statistics.

PeterPol · Post by **PeterPol** » Wed Nov 22, 2006 6:07 am

The aggregation key is a varchar(11). The 300 output columns in the aggregator stage are integers, and decimals (most with precision: 15,3)

Peter

tejaswini · Post by **tejaswini** » Wed Nov 22, 2006 6:11 am

Nagesh,

If there are any decimal type columns, then we can add an environmental variable 'APT_DECIMAL_INTERM_PRECISION' and set the value to 2. by default it is 38. by this we are limiting the internal calculations to use only 2 precision, which might speed up the job.

PerterPol,
It is enough to do hash partitioning in 'SORT' stage alone. and keep 'SAME' partitioning type in aggregator. no need to do hash partitioning again in aggregator.

Nageshsunkoji · Post by **Nageshsunkoji** » Wed Nov 22, 2006 7:29 am

tejaswini wrote:Nagesh,

If there are any decimal type columns, then we can add an environmental variable 'APT_DECIMAL_INTERM_PRECISION' and set the value to 2. by default it is 38. by this we are limiting the internal calculations to use only 2 precision, which might speed up the job.

PerterPol,
It is enough to do hash partitioning in 'SORT' stage alone. and keep 'SAME' partitioning type in aggregator. no need to do hash partitioning again in aggregator.

Hi Tejaswini,

I am sorry to say, your explanation is not giving proper information. The environmental variables APT_DECIMAL_INTERM_PRECISION and APT_DECIMAL_INTERM_SCALE are very helpful to you, if your data have the more than precision 38 and scale 10. This environmental variables are their to override the default values. If you reduce the Precision to 2, then your result is not proper and it is very wrong, though your are reducing the original vaules.

PeterPol · Post by **PeterPol** » Wed Nov 22, 2006 7:50 am

Sorting the data on the aggr key before aggregating performs no better...

Peter

ArndW · Post by **ArndW** » Wed Nov 22, 2006 7:54 am

Recall that there were 2 steps involved. The first was sorting.

ray.wurlod · Post by **ray.wurlod** » Wed Nov 22, 2006 8:04 am

Are the data pre-sorted when they are written to the Data Sets? If so, include an explicit Sort stage ahead of the Aggregator, and specifically make the sort rule for the key columns "Don't Sort (Previously Sorted)". This will prevent DataStage from inserting tsort operators.

View the score to see these inserted operators.

PeterPol · Post by **PeterPol** » Wed Nov 22, 2006 8:05 am

hi

I added a sort stage and adjusted the partition type to same in the aggregator. Isn't that enough?

Peter

thebird · Post by **thebird** » Wed Nov 22, 2006 8:47 am

PeterPol wrote:hi

I added a sort stage and adjusted the partition type to same in the aggregator. Isn't that enough?

Peter

Thats not enough. You have to specify in the Aggregator stage that the data has already been sorted.

Also set the Method property in the Aggregator stage to "Sort", by default it is Hash.

The Bird.