Aggregator stage performance

DSguru2B · Post by **DSguru2B** » Wed Nov 22, 2006 8:59 am

I think thats what you are missing, letting the aggregator know that the data is sorted on the specific fields. Just sorting wont help. You will have to inform the aggregator that it is sorted.

PeterPol · Post by **PeterPol** » Wed Nov 22, 2006 9:09 am

hi all

- The mode of operation was already set to sort..
- How/where do I specify that the data is already sorted. Cant find the option in the aggregator?

Peter

DSguru2B · Post by **DSguru2B** » Wed Nov 22, 2006 9:30 am

Excuse my ignorance. I keep thinking px aggregator works the same way as server. I just double checked and cannot find any place where you can specify the incoming data's sort order.

Nageshsunkoji · Post by **Nageshsunkoji** » Wed Nov 22, 2006 9:35 am

PeterPol wrote:hi all

- The mode of operation was already set to sort..
- How/where do I specify that the data is already sorted. Cant find the option in the aggregator?

Peter

Hi,

You can't find that option in the Aggregator. You can find the same in the sort stage to identify the sorted data.
If the Data is sorted on the same keys as your perofrming the aggregation in the source Dataset itself, as ray suggested already in his post. In the Sort stage select the option Don't sort if it is previously sorted. This option will restrict datastage to insert tsort operator. It will reduce the unnecessary sorting of already sorted data.

ray.wurlod · Post by **ray.wurlod** » Wed Nov 22, 2006 11:19 am

The place to find it is on the Partitioning tab on the input link. But the place to do it is in an upstream Sort stage, as a couple of posters have pointed out.

Nageshsunkoji · Post by **Nageshsunkoji** » Wed Nov 22, 2006 12:11 pm

Hi Peter,

Just try out this, it may helpful for you. Change the output datatype of every column in the aggregator to Double and in the down stream once again changed to decimal in any other operator.Its working finely for me.

tejaswini · Post by **tejaswini** » Wed Nov 22, 2006 11:02 pm

What is happening when we mention method = sort / hash in aggregator stage? How do they differ?

ram@puttur · Post by **ram@puttur** » Thu Nov 23, 2006 1:30 am

tejaswini wrote:What is happening when we mention method = sort / hash in aggregator stage? How do they differ?

Hi,

When using hash mode you should hash partition the input data by group keys.
However if you select sort mode, then aggregator stage expects the sorted input data (in the previous stage data has to be sorted).

thompsonp · Post by **thompsonp** » Thu Nov 23, 2006 2:15 am

As you don't appear to have isolated where the performance problem is (sort or aggregator), why don't you run a test.

Take your input dataset and hash partition and sort it on the grouping key. Write this to another dataset.

In a separate job read the sorted dataset into a sort stage that is set to don't sort already sorted and then into the aggregator (method = sort). Partitioning can be 'Same' for both stages. Write the output of the aggregator to another dataset or even just a Peek.

As you have a large number of calculations being performed and the incoming datatypes are not Double, I expect the performance bottleneck is down to the aggregator converting the input fields to doubles (and back to decimal again if you have selected this for the output).

I have run some tests on 3 different projects on version 7.1, 7.5 and 7.5x2 and all show a terrible performance drop when implicitly converting from decimal to double and back again.

I found a transformer to be quicker and a build-op to be even faster.
Of course both a transformer and a build-op will need the data sorted, whereas the aggregator could use a Hash method if you have enough memory to store the interim results.

Nageshsunkoji · Post by **Nageshsunkoji** » Thu Nov 23, 2006 5:07 am

Hi,

Few more points about Hash/ sort mode option in the Aggregator.

Hash Mode :

you should hash partition the input data set
by one or more of the grouping key columns so that all the records in
the same group are in the same partition (this happens automatically
if auto is set in the Partitioning tab)

Sort Mode :

you should hash partition the input data set
by one or more of the grouping key columns so that all the records in
the same group are in the same partition (this happens automatically
if auto is set in the Partitioning tab)

Your choice of mode depends primarily on the number of groupings
in the input data set, taking into account the amount of memory
available. You typically use hash mode for a relatively small number
of groups; generally, fewer than about 1000 groups per megabyte of
memory to be used.

DSXchange

Aggregator stage performance

method = sort/hash