Page 1 of 1

Abnormal termination of stage

Posted: Thu Jun 14, 2007 8:05 am
by bashbal
I can't figure out why some of my jobs are aborting. These jobs have been stable for months, but in the last week, we've suffered several unexplained errors. Here's an example:

Abnormal termination of stage LoadCustomerGroupDimension..XfrmCustomerGroup detected

After resetting the job in DS director (Just for you Ray :) ) I get the following "From previous run" messages:

From previous run
DataStage Job 256 Phantom 12616
jobnotify: Unknown error
DataStage Phantom Finished.
[12649] DSD.StageRun LoadCustomerGroupDimension. LoadCustomerGroupDimension.XfrmCustomerGroup 387 0/50 - terminated.
From previous run
DataStage Job 256 Phantom 12649
Abnormal termination of DataStage.
Fault type is 10. Layer type is BASIC run machine.
Fault occurred in BASIC program DSD.GetRTProp at address 258.


We have an approved project to upgrade to 7.5, but that it at least a couple of months away. Meanwhile, these failures are causing delays of a "management" critical report.

Posted: Thu Jun 14, 2007 3:10 pm
by ray.wurlod
As always, find out what's changed. It may not be something in DataStage - you may need to cast your net more widely. Were you running using a different login ID?

Or it could be a damaged or corrupted run-time hashed file in the Repository; the error message was generated from the (internal) routine DSGetRTProp (get run-time property). Execute the following commands in the project directory.

Code: Select all

. $DSHOME/dsenv
$DSHOME/bin/fixtool RT_CONFIG256
$DSHOME/bin/fixtool RT_STATUS256
Also check that these are writeable by your DataStage user.

Re: Abnormal termination of stage

Posted: Thu Jun 14, 2007 4:27 pm
by chulett
bashbal wrote:Fault type is 10. Layer type is BASIC run machine.
Fault occurred in BASIC program DSD.GetRTProp at address 258.
As noted, what changed? A fault type of 10 is a SIGBUS or 'bus error'. We get them on H-PUX when running Korn shell scripts if the $LD_PRELOAD environment variable is set in the dsenv file. And we got them in Oracle processing when we were using a 'bugged' version of the Oracle client with a nasty memory leak.

Any chance of either of those being the culprit?

More info

Posted: Fri Jun 15, 2007 7:38 am
by bashbal
Ray,

1) The user id running has not changed. It is always dsadm.
2) Thanks for the fixtool example. However, no errors were found. I'm going to run it against all of the job files just to be safe.
3) All RT* files are writable by dsadm.

Craig,
1) My sysadmin agrees with you about the $LD_PRELOAD. He discovered problems with this and removed it from dsenv a long time ago (> 1yr).
2) We will look into the possible leaky Oracle client. We are running Oracle9i Enterprise Edition Release 9.2.0.5.0 - 64bit Production

What has changed:
1) The only system level change was the rollout of Centrify active directory security to our server. However, this was a month ago and the dsadm, oracle, root & cognos user IDs are excluded from this process.
2) We did turnover some new jobs last week, but they run in a different job stream at a different time. We will check into them for collateral damage however.
3) The only other change is what changes every day...new data. It may be that we have been flirting with some limit for some time and recently loaded the "straw that broke the camel's back". Our sysadmin is analyzing system stats to see if we can find the "culprit".

OMT, the jobs don't fail every day. The last week went like this: Thurs - crashed, Fri - crash, Sat - ran okay, Sun - crash, Mon - crash, Tue -crash, Wed - ran okay, Thurs - crash, Fri(today) - ran okay.
No changes have been made.

Here's our recover procedure: Reset the jobs, and rerun with no changes. This is issue is very frustrating because we have not been able to consitantly reproduce the errors.

Posted: Fri Jun 15, 2007 7:46 am
by chulett
IBM Support identified that 9.2.0.5 Client to be one of the 'problematic' ones. However, not nearly as problematic as 9.2.0.1 which I was given 'accidentally' one day. Good Lord was that a mess until that little detail was discovered. :evil:

IBM suggested upgrading the client to 9.2.0.6, which we did and which resolved all occurrances of the issue for us. However, that's only applicable if the only jobs that crash with this SIGBUS error are using the OCI stages and I don't believe you've clarified that one way or the other.

Oracle patch

Posted: Fri Jun 15, 2007 8:47 am
by bashbal
We just received docs from Oracle on memory leak for 9.2.0.5. There is a work around listed as well as the 9.2.0.7 patch. He's going to try those and see if it helps.

To clarify, the job I listed uses two OCI 9i stages.

------------
I just ran DS.TOOLS->4. Check integrity of job files and received the following messages:
COMO DSR_CHECKER established 09:40:36 15 JUN 2007
"DS_JOBS.cleanup" already has a DATA definition record.
File name =
File not created.
Program "DS.CHECKER": Line 664, Improper data type.
:o

I must admit that I'm not that familliar with this command and only ran it because, based on its title, seemed like a good idea to try. I'm not even sure that it is reporting a problem. :?

Posted: Fri Jun 15, 2007 3:13 pm
by ray.wurlod
DS.CHECKER checks for orphaned DataStage repository objects. However in your case it did not complete correctly, apparently because an earlier run had been interrupted.

You will need to delete all references to DS_JOBS.cleanup (including the VOC entry if any) before proceeding with DS.CHECKER.

Posted: Mon Jun 18, 2007 9:49 am
by bashbal
ray.wurlod wrote: You will need to delete all references to DS_JOBS.cleanup (including the VOC entry if any) before proceeding with DS.CHECKER.
Thanks! It worked and I was able to run DS.CHECKER. However, it did not find any errors.

I also reindexed the project and cleared the &PH&. The Oracle admin is restarting the database every night to clear the memory. Something must be working because we didn't have any problems over the weekend.
Of course, we don't know what really fixed the problem...