Jobs Aborting

hcbranco · Post by **hcbranco** » Tue Apr 12, 2005 4:40 am

Hi...

I have some DS Jobs which aborts from time to time, ie, sometimes they run, sometimes they abort.
The probability of aborting increases if the number of running Jobs is high. Sometimes there are more than 100 Jobs, from different Projects, running at the same time.
I detected three types of message of the aborted jobs:

1. Record Wave
- Job aborts with the following error (red) message:
RJDIM.JIDQxDIMxBxESCALAOxANUL.JobControl (fatal error from DSGetJobInfo): Job control fatal error (-99) (DSGetJobInfo) JIDQxDIMxBxESCALAOxANUL.: RT_STATUS Record for wave 1 not found.
- This is a Multiple Instance DS Job (RJDIM) which runs a DS Routine, which in turn performs various validations and then runs the corresponding Job (JIDQxDIMxBxESCALAOxANUL)
- The final Job did not run.

2. Unknown State
- Job aborts with the following error (red) message:
RJFAC.JGRxREMxBxMESREFxBxFACxREMUNxEQUIV_01.JobControl (fatal error from JobControl): RJGRxREMxBxMESREFxBxFACxREMUNxEQUIV.01: aborted with unknow state! (21)
- This is also a Multiple Instance DS Job (RJFAC) which runs a DS Routine, which in turn performs various validations and then runs the corresponding Job (RJGRxREMxBxMESREFxBxFACxREMUNxEQUIV, Instance: 01)
- The final Job Instance did not run.

3. Green Abort
- Job aborts with the following error (green) message:
Job JGRxREMxBxMESREFxAxHASHxFILES.160 aborted.
- This is a DS Job that aborted with a green message.
- It has no error messages. It has some warnings (5), but those are expected to occur.

Did I just reach some limit on DS, or is it possible to tune the configuration of the DS Server, like config parameters?

The Environment is:
- DataStage 5.2
- SunOs 5.8 (8 CPU) with 16GB

Best Regards,
Hugo

roy · Post by **roy** » Tue Apr 12, 2005 5:01 am

Hi,
DS limits are set by OS (+login and environment)/kernel parameters, uvconfig (The DS configuration file) and resources available.

you need to monitor the machine in similar situations to determine what causes the aborts, are you reaching any limit with number of processes, memory and so on - I recomend the help of a sysadmin to do this.

P.S:
Sometimes there is a hard-coded limit in the scripts that start the dsrpcd process, in the machine's start-up process, that overwrite your uvconfig values so check that as well.

IHTH,

kduke · Post by **kduke** » Tue Apr 12, 2005 9:18 am

You need to look at several places. I would start to run top or vmstat during these times. See if you can figure out if you are running out of swap space. In version 5 we had to sleep 20 seconds before we would start a second process. It would then allow us to get more processes running without aborting. If you start 100 jobs one right after another in version 5 then it will not do it. I would also look to uvconfig settings on MFILES and few other parameters. Make sure these match up to UNIX kernel parameters. There is an old tech bulletin which discusses this. They should be on the install CD. If not maybe we can find them for you.

hcbranco · Post by **hcbranco** » Mon Apr 18, 2005 12:33 pm

Hi again.

I've spoken with the DataStage Administrator and we've decided to run a shell script to monitor de machine, via crontab.
The info collected is:
- Number of existing 'dscs' processes
- Number of existing 'phantom' processes
- Number of connections to DataStage (netstat grep for uvrpc)

At this precise moment, the DataStage is beginning to stuck.

There are:
- 44 'dscs' processes
- 60 'phantom' processes
- 47 connections to DataStage

The machine status is:
- 100% CPU usage (84% user, 16% kernel)
- 12G memory free (16G real)
- 12G swap free
- load average: 12

The MFILES parameter is 50.
The T30FILE parameter is 3000.

Any ideias of what may be happening?

Best Regards,
Hugo

kduke · Post by **kduke** » Mon Apr 18, 2005 1:45 pm

Run sysdef and show just the kernel parameters on the last few lines. It looks like MFILES may need to increase. The %user says what you are doing is not very effective. The 100% cpu says you are cpu bound and not IO bound but most DataStage processes are heavy IO. How long have these processes been running. If they are all in start up mode then you need to stagger the start up these processes. Add some sleep time between the start up of all these jobs. The results will be more through put because all your jobs will be working and not failing.

You need to figure out how much can you get to work instead of figuring out what fails. How many jobs can you get to run without them failing.

hcbranco · Post by **hcbranco** » Wed Apr 27, 2005 9:45 am

Hi again,

I've raised MFILES to 500.
sysdef gives says "0x0000000000000100:0x0000000000000400 file descriptors"
ulimit -a says "open files: 1024"
In some situations, I've also reduced the number of Jobs running in parallel.
But...
When the number of "phantom" processes is near 100 (ps -ef|grep -i phantom), sometimes a job aborts with
"RJFAC.JGRxREMxBxINCRxBxDUMP_03.JobControl (fatal error from DSGetJobInfo): Job control fatal error (-99)
(DSGetJobInfo) JGRxREMxBxINCRxBxDUMP.03: RT_STATUS Record for wave 1 not found."

About 100 "phantom" processes is too much for DS?

Best Regards,
Hugo

mhester · Post by **mhester** » Wed Apr 27, 2005 10:41 am

As pointed out in other posts, it might be better to filter on DSD.RUN vs Phantom since there can be many Phantoms for each job or job sequence. By doing this you can then see how many distinct processes are running. Another question might be - are these processes running many of the same jobs as multi instance jobs in parallel or are they all different, but running in parallel?

hcbranco · Post by **hcbranco** » Wed Apr 27, 2005 11:36 am

Hi again,

At this moment there are:
- 44 DSD.RUN
- 14 StageRun

The bigger part (70-80%) of DSD.RUN are multi instance jobs running in parallel.

Right now, DS is stable, but at some point of the chain the number of DSD.RUN/StageRun processes will rise up to 80/30 (estimated).

Best regards,
Hugo