Requirements for Aggregator Stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Inquisitive
Charter Member
Charter Member
Posts: 88
Joined: Tue Jan 13, 2004 3:07 pm

Requirements for Aggregator Stage

Post by Inquisitive »

Hi,

Is it Mandotory to Sort the incoming data when we use Aggreator Stage?

I have scenario where in I am reading data from a Dataset then to Aggregator and the writing into Dataset . Source Data set is Hash partitioned with GROUP BY keys in same order.

I am trying to process 1 million records with 6 GROUP BY keys and it is sum operation on other 6 fields. Without sort option it is taking 5 mins and with sort stage before Aggregator it is taking 12 mins.
I am trying to identify why it is taking so much time to process this data.

When I checked the log file I saw job is waiting for long time at the last entry as mentioned below before completing it.
(This is happening only when I try to Sort the data)

Event: aggDateSales,0: Hash table has grown to 16384 entries.
Event: aggDateSales,1: Hash table has grown to 32768 entries.
Event: aggDateSales,1: Hash table has grown to 32768 entries.
Event: aggDateSales,0: Hash table has grown to 32768 entries.
Event: aggDateSales,0: Hash table has grown to 65536 entries.
Event: aggDateSales,1: Hash table has grown to 65536 entries.
Event: aggDateSales,0: Hash table has grown to 131072 entries.
Event: aggDateSales,1: Hash table has grown to 131072 entries.
Event: aggDateSales,0: Hash table has grown to 262144 entries.
Event: aggDateSales,1: Hash table has grown to 262144 entries.
Event: aggDateSales,0: Hash table has grown to 524288 entries.
Event: aggDateSales,1: Hash table has grown to 524288 entries.


could it be because of space issue?

And, wheere data is temporarily stored when Aggregator stage could not hold all data in the memory? Is it in the resource disk as mentioned in the Config file?

Note: Sorry for asking too many questions in the message. I am not trying to make it complex but I am trying to give all information and trying to tie them together to understand how exactly Aggreator works and how can I tune this job.

Thanks
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

So many questions, maybe you should number them? :lol:

1. No, not mandatory.
2. No, the hash table is the temporary file holding the summarized groups. The more distinct groups, the more space, and thus tthe longer runtime because the number of rows being stored is greater.
3. You've answer your own next question.
4. Yes, thats why you allocate a bunch of scratch disk space. It gives the illusion of keeping data in memory, but it's actually scratching and swapping.

By sorting, the hash table stays small because it only keeps the current group in memory, as opposed to temporarily holding all groups in memory. You need to weigh the benefits of sorting versus just grinding thru the data.

Imagine if all of the rows summarized into a single result row. Sorting would have no benefit and would just waste a lot of time. Likewise, if every input row was a unique row then sorting would waste a lot of time but the Aggregator doesn't have to build a huge hash table to hold every row. You need to profile your data and decide which methods make sense for the "average" run of the data and how much performance degradation occurs if you deviate from that. Choose the method that consistently gives you the best performance for a large range of your data profiling characteristics.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Inquisitive
Charter Member
Charter Member
Posts: 88
Joined: Tue Jan 13, 2004 3:07 pm

Post by Inquisitive »

Thanks Kenneth,

That is very clear explaination.

I am seeing 1 million records in the source is grouping to 700,000 records in the target.

So it seems the final decision point is that which one is faster sorting the data or letting Aggregator build those Hash tables.

In my case I found building Hash Table is faster.

So I assume we dont have to sort the data even in following scenario.

- Reading data from a sequential file which is not sorted and running the job in 4 nodes .

In the above scenario it builds up all data into those Hash tables right.

Thanks
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Only until you exhaust available memory for the hash table, then it must spill to disk, which is quite a slowdown.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply