Jobs Aborting

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
hcbranco
Participant
Posts: 4
Joined: Wed Sep 01, 2004 3:57 am

Jobs Aborting

Post by hcbranco »

Hi...

I have some DS Jobs which aborts from time to time, ie, sometimes they run, sometimes they abort.
The probability of aborting increases if the number of running Jobs is high. Sometimes there are more than 100 Jobs, from different Projects, running at the same time.
I detected three types of message of the aborted jobs:

1. Record Wave
- Job aborts with the following error (red) message:
RJDIM.JIDQxDIMxBxESCALAOxANUL.JobControl (fatal error from DSGetJobInfo): Job control fatal error (-99) (DSGetJobInfo) JIDQxDIMxBxESCALAOxANUL.: RT_STATUS Record for wave 1 not found.
- This is a Multiple Instance DS Job (RJDIM) which runs a DS Routine, which in turn performs various validations and then runs the corresponding Job (JIDQxDIMxBxESCALAOxANUL)
- The final Job did not run.

2. Unknown State
- Job aborts with the following error (red) message:
RJFAC.JGRxREMxBxMESREFxBxFACxREMUNxEQUIV_01.JobControl (fatal error from JobControl): RJGRxREMxBxMESREFxBxFACxREMUNxEQUIV.01: aborted with unknow state! (21)
- This is also a Multiple Instance DS Job (RJFAC) which runs a DS Routine, which in turn performs various validations and then runs the corresponding Job (RJGRxREMxBxMESREFxBxFACxREMUNxEQUIV, Instance: 01)
- The final Job Instance did not run.

3. Green Abort
- Job aborts with the following error (green) message:
Job JGRxREMxBxMESREFxAxHASHxFILES.160 aborted.
- This is a DS Job that aborted with a green message.
- It has no error messages. It has some warnings (5), but those are expected to occur.

Did I just reach some limit on DS, or is it possible to tune the configuration of the DS Server, like config parameters?

The Environment is:
- DataStage 5.2
- SunOs 5.8 (8 CPU) with 16GB

Best Regards,
Hugo
roy
Participant
Posts: 2598
Joined: Wed Jul 30, 2003 2:05 am
Location: Israel

Post by roy »

Hi,
DS limits are set by OS (+login and environment)/kernel parameters, uvconfig (The DS configuration file) and resources available.

you need to monitor the machine in similar situations to determine what causes the aborts, are you reaching any limit with number of processes, memory and so on - I recomend the help of a sysadmin to do this.

P.S:
Sometimes there is a hard-coded limit in the scripts that start the dsrpcd process, in the machine's start-up process, that overwrite your uvconfig values so check that as well.

IHTH,
Roy R.
Time is money but when you don't have money time is all you can afford.

Search before posting:)

Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
Image
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

You need to look at several places. I would start to run top or vmstat during these times. See if you can figure out if you are running out of swap space. In version 5 we had to sleep 20 seconds before we would start a second process. It would then allow us to get more processes running without aborting. If you start 100 jobs one right after another in version 5 then it will not do it. I would also look to uvconfig settings on MFILES and few other parameters. Make sure these match up to UNIX kernel parameters. There is an old tech bulletin which discusses this. They should be on the install CD. If not maybe we can find them for you.
Mamu Kim
hcbranco
Participant
Posts: 4
Joined: Wed Sep 01, 2004 3:57 am

Post by hcbranco »

Hi again.

I've spoken with the DataStage Administrator and we've decided to run a shell script to monitor de machine, via crontab.
The info collected is:
- Number of existing 'dscs' processes
- Number of existing 'phantom' processes
- Number of connections to DataStage (netstat grep for uvrpc)

At this precise moment, the DataStage is beginning to stuck.

There are:
- 44 'dscs' processes
- 60 'phantom' processes
- 47 connections to DataStage

The machine status is:
- 100% CPU usage (84% user, 16% kernel)
- 12G memory free (16G real)
- 12G swap free
- load average: 12

The MFILES parameter is 50.
The T30FILE parameter is 3000.

Any ideias of what may be happening?

Best Regards,
Hugo
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Run sysdef and show just the kernel parameters on the last few lines. It looks like MFILES may need to increase. The %user says what you are doing is not very effective. The 100% cpu says you are cpu bound and not IO bound but most DataStage processes are heavy IO. How long have these processes been running. If they are all in start up mode then you need to stagger the start up these processes. Add some sleep time between the start up of all these jobs. The results will be more through put because all your jobs will be working and not failing.

You need to figure out how much can you get to work instead of figuring out what fails. How many jobs can you get to run without them failing.
Mamu Kim
hcbranco
Participant
Posts: 4
Joined: Wed Sep 01, 2004 3:57 am

Post by hcbranco »

Hi again,

I've raised MFILES to 500.
sysdef gives says "0x0000000000000100:0x0000000000000400 file descriptors"
ulimit -a says "open files: 1024"
In some situations, I've also reduced the number of Jobs running in parallel.
But...
When the number of "phantom" processes is near 100 (ps -ef|grep -i phantom), sometimes a job aborts with
"RJFAC.JGRxREMxBxINCRxBxDUMP_03.JobControl (fatal error from DSGetJobInfo): Job control fatal error (-99)
(DSGetJobInfo) JGRxREMxBxINCRxBxDUMP.03: RT_STATUS Record for wave 1 not found."


About 100 "phantom" processes is too much for DS?

Best Regards,
Hugo
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

As pointed out in other posts, it might be better to filter on DSD.RUN vs Phantom since there can be many Phantoms for each job or job sequence. By doing this you can then see how many distinct processes are running. Another question might be - are these processes running many of the same jobs as multi instance jobs in parallel or are they all different, but running in parallel?
hcbranco
Participant
Posts: 4
Joined: Wed Sep 01, 2004 3:57 am

Post by hcbranco »

Hi again,

At this moment there are:
- 44 DSD.RUN
- 14 StageRun

The bigger part (70-80%) of DSD.RUN are multi instance jobs running in parallel.

Right now, DS is stable, but at some point of the chain the number of DSD.RUN/StageRun processes will rise up to 80/30 (estimated).

Best regards,
Hugo
Post Reply