Page 1 of 1

why this fatal won't abort a job???

Posted: Sun Nov 06, 2005 8:52 pm
by chenxs
Dear all,

I meet this fatal in a parallel job, but the job is in finished status like normal... and have not aborted...

aggregate_1,0: Caught exception from runLocally(): APT_BadAlloc: Heap allocation failed..

Why this fatal won't abort the job??

many thanks...

Posted: Mon Nov 07, 2005 2:18 am
by ray.wurlod
"Heap allocation failed" means that it couldn't get more memory when it demanded it. So it has spilled to disk, and continued to process, albeit not as fast.

thanks

Posted: Mon Nov 07, 2005 3:18 am
by chenxs
thanks, ray

but I want this fatal can abort the job, not finished...

how can I do that?

Posted: Mon Nov 07, 2005 3:50 am
by ArndW
You cannot do that. It is an internal message from the PX engine. It really shouldn't be an error message, but just a warning message.

Why do you want this to abort your job? As Ray has explained, it means that there is no more virtual memory space to hold the data in your aggregation (might you be able to raise your ulimit values?) so PX has begun using disk space to hold temporary data.

You can use PX message handling to demote this message to a warning, or you could pre-sort your data coming into the aggregator stage so that no temporary space is required.

Posted: Mon Nov 07, 2005 4:20 am
by gbusson
hello,

open a case with Ascential.

there are some well known bugs of fatal errors which do not abort a job.

thx

Posted: Mon Nov 07, 2005 5:46 am
by chenxs
if this warning won't abort the job, the batch will continue...

but the result of this job is wrong (missing some data) ... so we need to abort this job...

and I have pre-sort ...

Posted: Mon Nov 07, 2005 5:48 am
by chenxs
maybe I need to write a after job routine in order to aborting the job when meet this fatal.......

Posted: Mon Nov 07, 2005 5:52 am
by ArndW
Ray and I have been saying that this error message seems to be a non-fatal one - meaning that your data is going to be the same regardless of whether this message is in the log or not.

Your incoming data needs to be sorted upon the key columns for your aggregation. It is not sorted that way, because if it were, the aggregation stage would have no need for interim storage and you wouldn't be getting this error message. If you check your sorting attributes in your job you can make this whole issue a moot point.

Posted: Mon Nov 07, 2005 6:39 am
by ray.wurlod
You are not losing any data when this message appears. All that is happening is that DataStage is being forced to use disk because there is not enough virtual memory available.

I agree with Arnd that it should not be a Fatal message.

Do check your ulimit settings for the user ID that processes DataStage jobs; you may be able to increase the amount of memory that that user can allocated. Your UNIX admininstrator will need to be involved, because only superuser can increase a ulimit.

Posted: Mon Nov 07, 2005 12:18 pm
by Ultramundane
You can set ulimit for a shell and inherited shells will have the new ulimit. That is, you could set it in your profile, logout and login, then bounce datastage.

Code: Select all

cat ~/.profile
## DATASTAGE PROFILE
unset ENV

## DATA
ulimit -d 1048576

## MEMORY
ulimit -m 1048576

## NOFILES
ulimit -n 10000

## STACK
ulimit -s 262144

## CORE DUMP SIZE
ulimit -c 4194304

. ~/.dsadm

if [ -s "$MAIL" ]           # This is at Shell startup.  In normal
then echo "$MAILMSG"        # operation, the Shell checks
fi                          # periodically.

## DISPLAY SOME HELP INFO ON LOGIN
. ~/.menu
If osh is running into memory allocation problems (lookup stage will fail when you use over 512 MB on reference link), you can also change osh to use large memory allocation.

On AIX, you would enter the following to change from 512 MB to 2GB.

Code: Select all

/usr/ccs/bin/ldedit -bmaxdata:0x80000000/dsa $APT_ORCHHOME/bin/osh
/usr/ccs/bin/ldedit:  File /Ascential/DataStage/PXEngine/bin/osh updated.

Posted: Mon Nov 07, 2005 7:45 pm
by chenxs
thanks all~

I meet heap allocation failed when /tmp is out of space. And aggregator stoped processing ( so missing some data) then the job finished immediately.

anyway, your suggestion is very useful for me, thanks