Hi,
We have a clarification on the way the Aggregator works.
We have to aggregate incoming records on 5 columns. Out of these 5 columns, the incoming records are already
sorted on 2 of the columns. So, can we assert on the Input tab of the aggregator that the incoming data is sorted on 2 of the columns, although I have 5 columns selected on the Output tab to group by on that.
Can the aggregator undertand that the data is already sorted on 2 of the columns and do the aggregation based on
that. Does it still accumulate all the incoming records and only then process it?
Thanks.
Passing partly sorted data into the Aggregator
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Not the case, Arnd.
The server Aggregator stage builds a dynamic table in memory, with one row for each combination of grouping keys. The initial size and increment of this table is tuneable using the final option on DS.TOOLS menu.
With unsorted data, the entire input stream must be loaded into this in-memory table.
With data sorted on some or all of the grouping keys, a change in grouping key value(s) can identify rows in the table that can be pushed onto the output link, and the memory they occupied freed. This is why sorted data are so valuable; if data are sorted on all grouping keys, the in-memory table requires only one row. In turn, this makes it impossible that it will take an access violation (SIGSEGV) error.
Data sorted on a subset of the grouping columns is a better situation than completely unsorted data, especially if the data are sorted on the early, rather than late, grouping columns.
The server Aggregator stage builds a dynamic table in memory, with one row for each combination of grouping keys. The initial size and increment of this table is tuneable using the final option on DS.TOOLS menu.
With unsorted data, the entire input stream must be loaded into this in-memory table.
With data sorted on some or all of the grouping keys, a change in grouping key value(s) can identify rows in the table that can be pushed onto the output link, and the memory they occupied freed. This is why sorted data are so valuable; if data are sorted on all grouping keys, the in-memory table requires only one row. In turn, this makes it impossible that it will take an access violation (SIGSEGV) error.
Data sorted on a subset of the grouping columns is a better situation than completely unsorted data, especially if the data are sorted on the early, rather than late, grouping columns.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
That's good to know. I've been under the impression up until now that partially sorted data was as worthwhile as unsorted data - kind of an 'all or nothing' proposition to gain any value from it. Not sure why, never ran any tests one way or the other, just assumed that I guess.
So it can still take advantage of the pre-sort on a subset of the grouping keys provided one asserts that in the stage? And you seem to be saying that even if that sorted subset are late grouping columns, it still helps?![Confused :?](./images/smilies/icon_confused.gif)
So it can still take advantage of the pre-sort on a subset of the grouping keys provided one asserts that in the stage? And you seem to be saying that even if that sorted subset are late grouping columns, it still helps?
![Confused :?](./images/smilies/icon_confused.gif)
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The memory table isn't sorted at all, whether or not the incoming data are sorted. That is, no sorting is applied. Indeed, what's built in memory is a hash table (not the same thing as a hashed file).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.