Aggregator stage performance

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

PeterPol
Premium Member
Premium Member
Posts: 73
Joined: Wed Mar 08, 2006 8:59 am

Aggregator stage performance

Post by PeterPol »

hi all,

my datastage job aggregates about 450000 records into 210 groups with 300 summed output columns.

The outcome of this aggregation is fine but performance is very poor. Running the job takes 45 minutes to finish. I tried to sort the data on the grouping column before aggregating, but without succes. The jobtime was even minutes longer. In an older version of this job I used a SQL group by_statement within sybase OC stage to do the aggregation. The performance then was much better (few minutes) than my current job. Since I now work with datasets as input, the former sybase solution cannot be used.

Is there a way to speed up the aggregation?


Thanks in advance.

Peter
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

If you can sort your incoming data stream to the aggregator on the columns you use in that stage and then specify in the aggregator that the columns are sorted you will see that the stage performs blazingly fast. This is because it no longer needs to store data temporarily, it just sums/aggregates values and when a group change occurs it outputs the result.
Nageshsunkoji
Participant
Posts: 222
Joined: Tue Aug 30, 2005 2:07 am
Location: pune
Contact:

Post by Nageshsunkoji »

Hi,

Adding few more points to the Arnd solution, perofrm HASH partition in the Sort stage before aggregator stage on grouping keys that you are using in the Aggregator stage. Your data will sort properly based on group keys and your job performance will increase.
NageshSunkoji

If you know anything SHARE it.............
If you Don't know anything LEARN it...............
tejaswini
Participant
Posts: 19
Joined: Thu Aug 26, 2004 5:40 am

Post by tejaswini »

among the columns which you are aggregating, are there any decimal columns?
Nageshsunkoji
Participant
Posts: 222
Joined: Tue Aug 30, 2005 2:07 am
Location: pune
Contact:

Post by Nageshsunkoji »

tejaswini wrote:among the columns which you are aggregating, are there any decimal columns?
Hi tejaswini,

May I know, how it make difference with the performance if we have the decimal fields in the aggregation key. Can you explain the same, please.
NageshSunkoji

If you know anything SHARE it.............
If you Don't know anything LEARN it...............
PeterPol
Premium Member
Premium Member
Posts: 73
Joined: Wed Mar 08, 2006 8:59 am

Post by PeterPol »

It doesn't seem to help (still 48 minutes), or may be I misunderstood your advise...

I added a sort stage before the aggregator stage to sort input on the aggregator grouping key. In both the sort and aggregator stage I selected hash partitioning.

Peter
Nageshsunkoji
Participant
Posts: 222
Joined: Tue Aug 30, 2005 2:07 am
Location: pune
Contact:

Post by Nageshsunkoji »

PeterPol wrote:It doesn't seem to help (still 48 minutes), or may be I misunderstood your advise...

I added a sort stage before the aggregator stage to sort input on the aggregator grouping key. In both the sort and aggregator stage I selected hash partitioning.

Peter
Hi Peter,

Its not required to perform HASH partition in both Sort stage and Aggregator stage. Just perform HASH partition in the sort stage and then select partitioning type as 'SAME' in the aggregator stage. Just try out this and let us know u r statistics.
NageshSunkoji

If you know anything SHARE it.............
If you Don't know anything LEARN it...............
PeterPol
Premium Member
Premium Member
Posts: 73
Joined: Wed Mar 08, 2006 8:59 am

Post by PeterPol »

The aggregation key is a varchar(11). The 300 output columns in the aggregator stage are integers, and decimals (most with precision: 15,3)

Peter
tejaswini
Participant
Posts: 19
Joined: Thu Aug 26, 2004 5:40 am

Post by tejaswini »

Nagesh,

If there are any decimal type columns, then we can add an environmental variable 'APT_DECIMAL_INTERM_PRECISION' and set the value to 2. by default it is 38. by this we are limiting the internal calculations to use only 2 precision, which might speed up the job.

PerterPol,
It is enough to do hash partitioning in 'SORT' stage alone. and keep 'SAME' partitioning type in aggregator. no need to do hash partitioning again in aggregator.
Nageshsunkoji
Participant
Posts: 222
Joined: Tue Aug 30, 2005 2:07 am
Location: pune
Contact:

Post by Nageshsunkoji »

tejaswini wrote:Nagesh,

If there are any decimal type columns, then we can add an environmental variable 'APT_DECIMAL_INTERM_PRECISION' and set the value to 2. by default it is 38. by this we are limiting the internal calculations to use only 2 precision, which might speed up the job.

PerterPol,
It is enough to do hash partitioning in 'SORT' stage alone. and keep 'SAME' partitioning type in aggregator. no need to do hash partitioning again in aggregator.
Hi Tejaswini,

I am sorry to say, your explanation is not giving proper information. The environmental variables APT_DECIMAL_INTERM_PRECISION and APT_DECIMAL_INTERM_SCALE are very helpful to you, if your data have the more than precision 38 and scale 10. This environmental variables are their to override the default values. If you reduce the Precision to 2, then your result is not proper and it is very wrong, though your are reducing the original vaules.
NageshSunkoji

If you know anything SHARE it.............
If you Don't know anything LEARN it...............
PeterPol
Premium Member
Premium Member
Posts: 73
Joined: Wed Mar 08, 2006 8:59 am

Post by PeterPol »

Sorting the data on the aggr key before aggregating performs no better...

Peter
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Recall that there were 2 steps involved. The first was sorting.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Are the data pre-sorted when they are written to the Data Sets? If so, include an explicit Sort stage ahead of the Aggregator, and specifically make the sort rule for the key columns "Don't Sort (Previously Sorted)". This will prevent DataStage from inserting tsort operators.

View the score to see these inserted operators.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
PeterPol
Premium Member
Premium Member
Posts: 73
Joined: Wed Mar 08, 2006 8:59 am

Post by PeterPol »

hi

I added a sort stage and adjusted the partition type to same in the aggregator. Isn't that enough?

Peter
thebird
Participant
Posts: 254
Joined: Thu Jan 06, 2005 12:11 am
Location: India
Contact:

Post by thebird »

PeterPol wrote:hi

I added a sort stage and adjusted the partition type to same in the aggregator. Isn't that enough?

Peter
Thats not enough. You have to specify in the Aggregator stage that the data has already been sorted.

Also set the Method property in the Aggregator stage to "Sort", by default it is Hash.

The Bird.
Post Reply