enormous aggregator

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
jasper
Participant
Posts: 111
Joined: Mon May 06, 2002 1:25 am
Location: Belgium

enormous aggregator

Post by jasper »

Hi,
we're trying to process internet-usage records. We're talking about 60million records per day. These records need to be aggregated.
However output from the aggregator is still very large( 40million).

We have some memory and swap, but this doesn't seem to be enough. (32 GB mem, 32 GB swap). Our unix-admin does not want to expand this.

Are there any parameters that can decrease the memory-usage (if this decreases performance we can still live with it.) ?

Are there other ways to decrease memory-usage , like inserting a filter stage, which leads to 2 smaller aggregators, or is the sum of these the same?

For now my only option would be to run this job multiple times, but this also leads to longer runtimes (because of other queries involved running 3 times for 1/3 of the records takes about 3 times longer.)
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Jasper,

the reason that the aggregator is using so much memory is because is it doing it's own sorting prior to aggregating. If you were to sort the incoming data using your UNIX sort (or CoSort/SyncSort if you have them) and then telling the aggregator stage that you are working with sorted data you will see that memory and temp disk usage will go down as well as having a good chance that your performance will also go up.
jasper
Participant
Posts: 111
Joined: Mon May 06, 2002 1:25 am
Location: Belgium

Post by jasper »

thanx for the hint, this indeed decreases memory usage enormously.

however:
I've never worked with unix-sort before, but I seem to be having problems with available datatypes. When I insert a unix-sort it gives errors on timestamp and decimal not allowed as keys.
So this option means an extra transform(or modify) before and after the sort/aggregator.
Are there other ways?
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi Jasper,
Yes it seems the acceptable data types are only int8, uint8, int16, uint16, int32,uint32, int64, uint64, sfloat, dfloat, string.
You can just give a try by changing the datatype of the timestamp as varchar and do a sort.

Hi Arnd,
Is the datastage utility is not that efficient for huge date. May i know what is the difference between unix sort and datastage?

-Kumar
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

kumar_s wrote:Is the datastage utility is not that efficient for huge date. May i know what is the difference between unix sort and datastage?
At least a 10x speed improvement for the UNIX sort, from what I've seen on my system... HP 'Superdome'. And the UNIX sort can handle large volumes.

Disclaimer: my experience is with the Sort stage in Server jobs. FYI.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Always choose your tools according to the job at hand - the UNIX sort will work much faster at large volumes than either Server or PX sorts or aggregators can. Any datatype sorted in DS can also be sorted the same way using the UNIX sort. The timestamp will either be an integer value representation or a string one - both of which can be handled with the sort command.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi Arnd/Craig,

Based on you previous experience can you give range (max size/ number of lines) at which the datastage utility slog off.
Or in other words, the advisable cut off limit for unix sort to be used more effieciently.

Hi Arnd,

Surprisingly timestamp can be sorted through datastage utility but not through unix sort.

-Kumar
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I don't think there's a number you can give that wouldn't vary from system to system. More of a threshold of pain... if your sort takes 'too long', try it from the command line.

And what makes you think you can't sort a timestamp from UNIX? An ISO formatted timestamp is perfectly sortable, even as a string. :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply