enormous aggregator

jasper · Post by **jasper** » Tue Dec 27, 2005 2:54 am

Hi,
we're trying to process internet-usage records. We're talking about 60million records per day. These records need to be aggregated.
However output from the aggregator is still very large( 40million).

We have some memory and swap, but this doesn't seem to be enough. (32 GB mem, 32 GB swap). Our unix-admin does not want to expand this.

Are there any parameters that can decrease the memory-usage (if this decreases performance we can still live with it.) ?

Are there other ways to decrease memory-usage , like inserting a filter stage, which leads to 2 smaller aggregators, or is the sum of these the same?

For now my only option would be to run this job multiple times, but this also leads to longer runtimes (because of other queries involved running 3 times for 1/3 of the records takes about 3 times longer.)

ArndW · Post by **ArndW** » Tue Dec 27, 2005 3:49 am

Jasper,

the reason that the aggregator is using so much memory is because is it doing it's own sorting prior to aggregating. If you were to sort the incoming data using your UNIX sort (or CoSort/SyncSort if you have them) and then telling the aggregator stage that you are working with sorted data you will see that memory and temp disk usage will go down as well as having a good chance that your performance will also go up.

jasper · Post by **jasper** » Tue Dec 27, 2005 6:55 am

thanx for the hint, this indeed decreases memory usage enormously.

however:
I've never worked with unix-sort before, but I seem to be having problems with available datatypes. When I insert a unix-sort it gives errors on timestamp and decimal not allowed as keys.
So this option means an extra transform(or modify) before and after the sort/aggregator.
Are there other ways?

kumar_s · Post by **kumar_s** » Tue Dec 27, 2005 7:43 am

Hi Jasper,
Yes it seems the acceptable data types are only int8, uint8, int16, uint16, int32,uint32, int64, uint64, sfloat, dfloat, string.
You can just give a try by changing the datatype of the timestamp as varchar and do a sort.

Hi Arnd,
Is the datastage utility is not that efficient for huge date. May i know what is the difference between unix sort and datastage?

-Kumar

chulett · Post by **chulett** » Tue Dec 27, 2005 7:54 am

kumar_s wrote:Is the datastage utility is not that efficient for huge date. May i know what is the difference between unix sort and datastage?

At least a 10x speed improvement for the UNIX sort, from what I've seen on my system... HP 'Superdome'. And the UNIX sort can handle large volumes.

Disclaimer: my experience is with the Sort stage in Server jobs. FYI.

ArndW · Post by **ArndW** » Tue Dec 27, 2005 9:14 am

Always choose your tools according to the job at hand - the UNIX sort will work much faster at large volumes than either Server or PX sorts or aggregators can. Any datatype sorted in DS can also be sorted the same way using the UNIX sort. The timestamp will either be an integer value representation or a string one - both of which can be handled with the sort command.

kumar_s · Post by **kumar_s** » Tue Dec 27, 2005 8:22 pm

Hi Arnd/Craig,

Based on you previous experience can you give range (max size/ number of lines) at which the datastage utility slog off.
Or in other words, the advisable cut off limit for unix sort to be used more effieciently.

Hi Arnd,

Surprisingly timestamp can be sorted through datastage utility but not through unix sort.

-Kumar

chulett · Post by **chulett** » Tue Dec 27, 2005 10:19 pm

I don't think there's a number you can give that wouldn't vary from system to system. More of a threshold of pain... if your sort takes 'too long', try it from the command line.

And what makes you think you can't sort a timestamp from UNIX? An ISO formatted timestamp is perfectly sortable, even as a string.