Passing partly sorted data into the Aggregator

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
vnspn
Participant
Posts: 165
Joined: Mon Feb 12, 2007 11:42 am

Passing partly sorted data into the Aggregator

Post by vnspn »

Hi,

We have a clarification on the way the Aggregator works.

We have to aggregate incoming records on 5 columns. Out of these 5 columns, the incoming records are already
sorted on 2 of the columns. So, can we assert on the Input tab of the aggregator that the incoming data is sorted on 2 of the columns, although I have 5 columns selected on the Output tab to group by on that.

Can the aggregator undertand that the data is already sorted on 2 of the columns and do the aggregation based on
that. Does it still accumulate all the incoming records and only then process it?

Thanks.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

It will still need to sort on 3 of the 5 keys, but partially sorted data will sort much quicker than nonsorted data.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Not the case, Arnd.

The server Aggregator stage builds a dynamic table in memory, with one row for each combination of grouping keys. The initial size and increment of this table is tuneable using the final option on DS.TOOLS menu.

With unsorted data, the entire input stream must be loaded into this in-memory table.

With data sorted on some or all of the grouping keys, a change in grouping key value(s) can identify rows in the table that can be pushed onto the output link, and the memory they occupied freed. This is why sorted data are so valuable; if data are sorted on all grouping keys, the in-memory table requires only one row. In turn, this makes it impossible that it will take an access violation (SIGSEGV) error.

Data sorted on a subset of the grouping columns is a better situation than completely unsorted data, especially if the data are sorted on the early, rather than late, grouping columns.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

That's good to know. I've been under the impression up until now that partially sorted data was as worthwhile as unsorted data - kind of an 'all or nothing' proposition to gain any value from it. Not sure why, never ran any tests one way or the other, just assumed that I guess.

So it can still take advantage of the pre-sort on a subset of the grouping keys provided one asserts that in the stage? And you seem to be saying that even if that sorted subset are late grouping columns, it still helps? :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Ray - I understand your explanation and that is how I see the aggregation stage working. But I'm not sure which of the two assertions I made isn't correct - that a partially sorted table is faster than a nonsorted table or that the aggregator stage still needs to sort on those extra columns?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The memory table isn't sorted at all, whether or not the incoming data are sorted. That is, no sorting is applied. Indeed, what's built in memory is a hash table (not the same thing as a hashed file).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Ahh, I see what you meant.
Post Reply