Passing partly sorted data into the Aggregator

vnspn · Post by **vnspn** » Tue Jun 12, 2007 4:02 pm

Hi,

We have a clarification on the way the Aggregator works.

We have to aggregate incoming records on 5 columns. Out of these 5 columns, the incoming records are already
sorted on 2 of the columns. So, can we assert on the Input tab of the aggregator that the incoming data is sorted on 2 of the columns, although I have 5 columns selected on the Output tab to group by on that.

Can the aggregator undertand that the data is already sorted on 2 of the columns and do the aggregation based on
that. Does it still accumulate all the incoming records and only then process it?

Thanks.

ArndW · Post by **ArndW** » Tue Jun 12, 2007 4:38 pm

It will still need to sort on 3 of the 5 keys, but partially sorted data will sort much quicker than nonsorted data.

ray.wurlod · Post by **ray.wurlod** » Tue Jun 12, 2007 7:38 pm

Not the case, Arnd.

The server Aggregator stage builds a dynamic table in memory, with one row for each combination of grouping keys. The initial size and increment of this table is tuneable using the final option on DS.TOOLS menu.

With unsorted data, the entire input stream must be loaded into this in-memory table.

With data sorted on some or all of the grouping keys, a change in grouping key value(s) can identify rows in the table that can be pushed onto the output link, and the memory they occupied freed. This is why sorted data are so valuable; if data are sorted on all grouping keys, the in-memory table requires only one row. In turn, this makes it impossible that it will take an access violation (SIGSEGV) error.

Data sorted on a subset of the grouping columns is a better situation than completely unsorted data, especially if the data are sorted on the early, rather than late, grouping columns.

chulett · Post by **chulett** » Tue Jun 12, 2007 11:56 pm

That's good to know. I've been under the impression up until now that partially sorted data was as worthwhile as unsorted data - kind of an 'all or nothing' proposition to gain any value from it. Not sure why, never ran any tests one way or the other, just assumed that I guess.

So it can still take advantage of the pre-sort on a subset of the grouping keys provided one asserts that in the stage? And you seem to be saying that even if that sorted subset are late grouping columns, it still helps?

ArndW · Post by **ArndW** » Wed Jun 13, 2007 12:19 am

Ray - I understand your explanation and that is how I see the aggregation stage working. But I'm not sure which of the two assertions I made isn't correct - that a partially sorted table is faster than a nonsorted table or that the aggregator stage still needs to sort on those extra columns?

ray.wurlod · Post by **ray.wurlod** » Wed Jun 13, 2007 2:12 am

The memory table isn't sorted at all, whether or not the incoming data are sorted. That is, no sorting is applied. Indeed, what's built in memory is a hash table (not the same thing as a hashed file).

ArndW · Post by **ArndW** » Wed Jun 13, 2007 2:28 am

Ahh, I see what you meant.