Aggregator stage performance

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

I think thats what you are missing, letting the aggregator know that the data is sorted on the specific fields. Just sorting wont help. You will have to inform the aggregator that it is sorted.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
PeterPol
Premium Member
Premium Member
Posts: 73
Joined: Wed Mar 08, 2006 8:59 am

Post by PeterPol »

hi all

- The mode of operation was already set to sort..
- How/where do I specify that the data is already sorted. Cant find the option in the aggregator?

Peter
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Excuse my ignorance. I keep thinking px aggregator works the same way as server. I just double checked and cannot find any place where you can specify the incoming data's sort order.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
Nageshsunkoji
Participant
Posts: 222
Joined: Tue Aug 30, 2005 2:07 am
Location: pune
Contact:

Post by Nageshsunkoji »

PeterPol wrote:hi all

- The mode of operation was already set to sort..
- How/where do I specify that the data is already sorted. Cant find the option in the aggregator?

Peter
Hi,

You can't find that option in the Aggregator. You can find the same in the sort stage to identify the sorted data.
If the Data is sorted on the same keys as your perofrming the aggregation in the source Dataset itself, as ray suggested already in his post. In the Sort stage select the option Don't sort if it is previously sorted. This option will restrict datastage to insert tsort operator. It will reduce the unnecessary sorting of already sorted data.
NageshSunkoji

If you know anything SHARE it.............
If you Don't know anything LEARN it...............
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The place to find it is on the Partitioning tab on the input link. But the place to do it is in an upstream Sort stage, as a couple of posters have pointed out.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Nageshsunkoji
Participant
Posts: 222
Joined: Tue Aug 30, 2005 2:07 am
Location: pune
Contact:

Post by Nageshsunkoji »

Hi Peter,

Just try out this, it may helpful for you. Change the output datatype of every column in the aggregator to Double and in the down stream once again changed to decimal in any other operator.Its working finely for me.
NageshSunkoji

If you know anything SHARE it.............
If you Don't know anything LEARN it...............
tejaswini
Participant
Posts: 19
Joined: Thu Aug 26, 2004 5:40 am

Post by tejaswini »

What is happening when we mention method = sort / hash in aggregator stage? How do they differ?
ram@puttur
Participant
Posts: 1
Joined: Tue Nov 21, 2006 3:33 am

method = sort/hash

Post by ram@puttur »

tejaswini wrote:What is happening when we mention method = sort / hash in aggregator stage? How do they differ?

Hi,

When using hash mode you should hash partition the input data by group keys.
However if you select sort mode, then aggregator stage expects the sorted input data (in the previous stage data has to be sorted).

thompsonp
Premium Member
Premium Member
Posts: 205
Joined: Tue Mar 01, 2005 8:41 am

Post by thompsonp »

As you don't appear to have isolated where the performance problem is (sort or aggregator), why don't you run a test.

Take your input dataset and hash partition and sort it on the grouping key. Write this to another dataset.

In a separate job read the sorted dataset into a sort stage that is set to don't sort already sorted and then into the aggregator (method = sort). Partitioning can be 'Same' for both stages. Write the output of the aggregator to another dataset or even just a Peek.

As you have a large number of calculations being performed and the incoming datatypes are not Double, I expect the performance bottleneck is down to the aggregator converting the input fields to doubles (and back to decimal again if you have selected this for the output).

I have run some tests on 3 different projects on version 7.1, 7.5 and 7.5x2 and all show a terrible performance drop when implicitly converting from decimal to double and back again.

I found a transformer to be quicker and a build-op to be even faster.
Of course both a transformer and a build-op will need the data sorted, whereas the aggregator could use a Hash method if you have enough memory to store the interim results.
Nageshsunkoji
Participant
Posts: 222
Joined: Tue Aug 30, 2005 2:07 am
Location: pune
Contact:

Post by Nageshsunkoji »

Hi,

Few more points about Hash/ sort mode option in the Aggregator.

Hash Mode :

you should hash partition the input data set
by one or more of the grouping key columns so that all the records in
the same group are in the same partition (this happens automatically
if auto is set in the Partitioning tab)

Sort Mode :

you should hash partition the input data set
by one or more of the grouping key columns so that all the records in
the same group are in the same partition (this happens automatically
if auto is set in the Partitioning tab)

Your choice of mode depends primarily on the number of groupings
in the input data set, taking into account the amount of memory
available. You typically use hash mode for a relatively small number
of groups; generally, fewer than about 1000 groups per megabyte of
memory to be used.
NageshSunkoji

If you know anything SHARE it.............
If you Don't know anything LEARN it...............
Post Reply