Aggregator stage performance
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 222
- Joined: Tue Aug 30, 2005 2:07 am
- Location: pune
- Contact:
Hi,PeterPol wrote:hi all
- The mode of operation was already set to sort..
- How/where do I specify that the data is already sorted. Cant find the option in the aggregator?
Peter
You can't find that option in the Aggregator. You can find the same in the sort stage to identify the sorted data.
If the Data is sorted on the same keys as your perofrming the aggregation in the source Dataset itself, as ray suggested already in his post. In the Sort stage select the option Don't sort if it is previously sorted. This option will restrict datastage to insert tsort operator. It will reduce the unnecessary sorting of already sorted data.
NageshSunkoji
If you know anything SHARE it.............
If you Don't know anything LEARN it...............
If you know anything SHARE it.............
If you Don't know anything LEARN it...............
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The place to find it is on the Partitioning tab on the input link. But the place to do it is in an upstream Sort stage, as a couple of posters have pointed out.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 222
- Joined: Tue Aug 30, 2005 2:07 am
- Location: pune
- Contact:
Hi Peter,
Just try out this, it may helpful for you. Change the output datatype of every column in the aggregator to Double and in the down stream once again changed to decimal in any other operator.Its working finely for me.
Just try out this, it may helpful for you. Change the output datatype of every column in the aggregator to Double and in the down stream once again changed to decimal in any other operator.Its working finely for me.
NageshSunkoji
If you know anything SHARE it.............
If you Don't know anything LEARN it...............
If you know anything SHARE it.............
If you Don't know anything LEARN it...............
-
- Participant
- Posts: 1
- Joined: Tue Nov 21, 2006 3:33 am
method = sort/hash
tejaswini wrote:What is happening when we mention method = sort / hash in aggregator stage? How do they differ?
Hi,
When using hash mode you should hash partition the input data by group keys.
However if you select sort mode, then aggregator stage expects the sorted input data (in the previous stage data has to be sorted).
As you don't appear to have isolated where the performance problem is (sort or aggregator), why don't you run a test.
Take your input dataset and hash partition and sort it on the grouping key. Write this to another dataset.
In a separate job read the sorted dataset into a sort stage that is set to don't sort already sorted and then into the aggregator (method = sort). Partitioning can be 'Same' for both stages. Write the output of the aggregator to another dataset or even just a Peek.
As you have a large number of calculations being performed and the incoming datatypes are not Double, I expect the performance bottleneck is down to the aggregator converting the input fields to doubles (and back to decimal again if you have selected this for the output).
I have run some tests on 3 different projects on version 7.1, 7.5 and 7.5x2 and all show a terrible performance drop when implicitly converting from decimal to double and back again.
I found a transformer to be quicker and a build-op to be even faster.
Of course both a transformer and a build-op will need the data sorted, whereas the aggregator could use a Hash method if you have enough memory to store the interim results.
Take your input dataset and hash partition and sort it on the grouping key. Write this to another dataset.
In a separate job read the sorted dataset into a sort stage that is set to don't sort already sorted and then into the aggregator (method = sort). Partitioning can be 'Same' for both stages. Write the output of the aggregator to another dataset or even just a Peek.
As you have a large number of calculations being performed and the incoming datatypes are not Double, I expect the performance bottleneck is down to the aggregator converting the input fields to doubles (and back to decimal again if you have selected this for the output).
I have run some tests on 3 different projects on version 7.1, 7.5 and 7.5x2 and all show a terrible performance drop when implicitly converting from decimal to double and back again.
I found a transformer to be quicker and a build-op to be even faster.
Of course both a transformer and a build-op will need the data sorted, whereas the aggregator could use a Hash method if you have enough memory to store the interim results.
-
- Participant
- Posts: 222
- Joined: Tue Aug 30, 2005 2:07 am
- Location: pune
- Contact:
Hi,
Few more points about Hash/ sort mode option in the Aggregator.
Hash Mode :
you should hash partition the input data set
by one or more of the grouping key columns so that all the records in
the same group are in the same partition (this happens automatically
if auto is set in the Partitioning tab)
Sort Mode :
you should hash partition the input data set
by one or more of the grouping key columns so that all the records in
the same group are in the same partition (this happens automatically
if auto is set in the Partitioning tab)
Your choice of mode depends primarily on the number of groupings
in the input data set, taking into account the amount of memory
available. You typically use hash mode for a relatively small number
of groups; generally, fewer than about 1000 groups per megabyte of
memory to be used.
Few more points about Hash/ sort mode option in the Aggregator.
Hash Mode :
you should hash partition the input data set
by one or more of the grouping key columns so that all the records in
the same group are in the same partition (this happens automatically
if auto is set in the Partitioning tab)
Sort Mode :
you should hash partition the input data set
by one or more of the grouping key columns so that all the records in
the same group are in the same partition (this happens automatically
if auto is set in the Partitioning tab)
Your choice of mode depends primarily on the number of groupings
in the input data set, taking into account the amount of memory
available. You typically use hash mode for a relatively small number
of groups; generally, fewer than about 1000 groups per megabyte of
memory to be used.
NageshSunkoji
If you know anything SHARE it.............
If you Don't know anything LEARN it...............
If you know anything SHARE it.............
If you Don't know anything LEARN it...............