Abnormal termination of stage

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
bashbal
Premium Member
Premium Member
Posts: 23
Joined: Mon Mar 01, 2004 12:26 pm
Location: Milwaukee, WI
Contact:

Abnormal termination of stage

Post by bashbal »

I can't figure out why some of my jobs are aborting. These jobs have been stable for months, but in the last week, we've suffered several unexplained errors. Here's an example:

Abnormal termination of stage LoadCustomerGroupDimension..XfrmCustomerGroup detected

After resetting the job in DS director (Just for you Ray :) ) I get the following "From previous run" messages:

From previous run
DataStage Job 256 Phantom 12616
jobnotify: Unknown error
DataStage Phantom Finished.
[12649] DSD.StageRun LoadCustomerGroupDimension. LoadCustomerGroupDimension.XfrmCustomerGroup 387 0/50 - terminated.
From previous run
DataStage Job 256 Phantom 12649
Abnormal termination of DataStage.
Fault type is 10. Layer type is BASIC run machine.
Fault occurred in BASIC program DSD.GetRTProp at address 258.


We have an approved project to upgrade to 7.5, but that it at least a couple of months away. Meanwhile, these failures are causing delays of a "management" critical report.
Lyle
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

As always, find out what's changed. It may not be something in DataStage - you may need to cast your net more widely. Were you running using a different login ID?

Or it could be a damaged or corrupted run-time hashed file in the Repository; the error message was generated from the (internal) routine DSGetRTProp (get run-time property). Execute the following commands in the project directory.

Code: Select all

. $DSHOME/dsenv
$DSHOME/bin/fixtool RT_CONFIG256
$DSHOME/bin/fixtool RT_STATUS256
Also check that these are writeable by your DataStage user.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Re: Abnormal termination of stage

Post by chulett »

bashbal wrote:Fault type is 10. Layer type is BASIC run machine.
Fault occurred in BASIC program DSD.GetRTProp at address 258.
As noted, what changed? A fault type of 10 is a SIGBUS or 'bus error'. We get them on H-PUX when running Korn shell scripts if the $LD_PRELOAD environment variable is set in the dsenv file. And we got them in Oracle processing when we were using a 'bugged' version of the Oracle client with a nasty memory leak.

Any chance of either of those being the culprit?
-craig

"You can never have too many knives" -- Logan Nine Fingers
bashbal
Premium Member
Premium Member
Posts: 23
Joined: Mon Mar 01, 2004 12:26 pm
Location: Milwaukee, WI
Contact:

More info

Post by bashbal »

Ray,

1) The user id running has not changed. It is always dsadm.
2) Thanks for the fixtool example. However, no errors were found. I'm going to run it against all of the job files just to be safe.
3) All RT* files are writable by dsadm.

Craig,
1) My sysadmin agrees with you about the $LD_PRELOAD. He discovered problems with this and removed it from dsenv a long time ago (> 1yr).
2) We will look into the possible leaky Oracle client. We are running Oracle9i Enterprise Edition Release 9.2.0.5.0 - 64bit Production

What has changed:
1) The only system level change was the rollout of Centrify active directory security to our server. However, this was a month ago and the dsadm, oracle, root & cognos user IDs are excluded from this process.
2) We did turnover some new jobs last week, but they run in a different job stream at a different time. We will check into them for collateral damage however.
3) The only other change is what changes every day...new data. It may be that we have been flirting with some limit for some time and recently loaded the "straw that broke the camel's back". Our sysadmin is analyzing system stats to see if we can find the "culprit".

OMT, the jobs don't fail every day. The last week went like this: Thurs - crashed, Fri - crash, Sat - ran okay, Sun - crash, Mon - crash, Tue -crash, Wed - ran okay, Thurs - crash, Fri(today) - ran okay.
No changes have been made.

Here's our recover procedure: Reset the jobs, and rerun with no changes. This is issue is very frustrating because we have not been able to consitantly reproduce the errors.
Lyle
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

IBM Support identified that 9.2.0.5 Client to be one of the 'problematic' ones. However, not nearly as problematic as 9.2.0.1 which I was given 'accidentally' one day. Good Lord was that a mess until that little detail was discovered. :evil:

IBM suggested upgrading the client to 9.2.0.6, which we did and which resolved all occurrances of the issue for us. However, that's only applicable if the only jobs that crash with this SIGBUS error are using the OCI stages and I don't believe you've clarified that one way or the other.
-craig

"You can never have too many knives" -- Logan Nine Fingers
bashbal
Premium Member
Premium Member
Posts: 23
Joined: Mon Mar 01, 2004 12:26 pm
Location: Milwaukee, WI
Contact:

Oracle patch

Post by bashbal »

We just received docs from Oracle on memory leak for 9.2.0.5. There is a work around listed as well as the 9.2.0.7 patch. He's going to try those and see if it helps.

To clarify, the job I listed uses two OCI 9i stages.

------------
I just ran DS.TOOLS->4. Check integrity of job files and received the following messages:
COMO DSR_CHECKER established 09:40:36 15 JUN 2007
"DS_JOBS.cleanup" already has a DATA definition record.
File name =
File not created.
Program "DS.CHECKER": Line 664, Improper data type.
:o

I must admit that I'm not that familliar with this command and only ran it because, based on its title, seemed like a good idea to try. I'm not even sure that it is reporting a problem. :?
Lyle
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

DS.CHECKER checks for orphaned DataStage repository objects. However in your case it did not complete correctly, apparently because an earlier run had been interrupted.

You will need to delete all references to DS_JOBS.cleanup (including the VOC entry if any) before proceeding with DS.CHECKER.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
bashbal
Premium Member
Premium Member
Posts: 23
Joined: Mon Mar 01, 2004 12:26 pm
Location: Milwaukee, WI
Contact:

Post by bashbal »

ray.wurlod wrote: You will need to delete all references to DS_JOBS.cleanup (including the VOC entry if any) before proceeding with DS.CHECKER.
Thanks! It worked and I was able to run DS.CHECKER. However, it did not find any errors.

I also reindexed the project and cleared the &PH&. The Oracle admin is restarting the database every night to clear the memory. Something must be working because we didn't have any problems over the weekend.
Of course, we don't know what really fixed the problem...
Lyle
Post Reply