Page 1 of 1

Error calling DSRunJob -99

Posted: Wed Sep 19, 2007 8:46 am
by edwds
Good morning. I need your help guys!!! For past 14 evenings we have been having a particular sequence fail. This sequence fires off about 100 simulataneous server and parallel jobs (mostly server). All the jobs are very simple. No hash files are used in these jobs. All they do is read from one Oracle table and load to another. Every evening we get the above error associated to a different job, but always part of the same sequence. Once it fails we submit the sequence again 5 minutes later and it works. How can we debug this problem? :shock: :shock:

Re: Error calling DSRunJob -99

Posted: Wed Sep 19, 2007 8:59 am
by sachin1
what is an error message please post it

Re: Error calling DSRunJob -99

Posted: Wed Sep 19, 2007 10:47 am
by kwwilliams
Sounds like your server is overloaded. DataStage cannot run all of the sequences because it does not have the resources and so it fails. Which is why you can rerun the jobs five minutes later without a problem.

Error message

Posted: Wed Sep 19, 2007 11:00 am
by edwds
23:44:04: Exception raised: @srcTRANTYP, Error calling DSRunJob(srcTRANTYP), code=-99 [General repository interface 'other error']

We have looked at memory and cpu once the job has failed and both are around 50% when this error occurs.

Re: Error message

Posted: Wed Sep 19, 2007 11:50 am
by kwwilliams
Those aren't the only resources. By starting up 100 jobs how many pids are you creating? Have you hit the limit for your user? Can DataStage spawn them fast enough for you to be able to avoid the wait time, after a certain period of time it will abort because it could spawn the next process.

Try not starting one hundred jobs at one time. Pretty sure that will solve your problem.

Posted: Wed Sep 19, 2007 12:14 pm
by edwds
But then why would the same exact sequence run fine 5 minutes later.

Posted: Wed Sep 19, 2007 3:17 pm
by kwwilliams
Resources became free at that point. Whether PID or any things else. Having 100 jobs kick off at the same time is going to run into issues, Is there a reson to ick them all off at the same time or was it just easier?

Posted: Tue Sep 25, 2007 9:59 am
by edwds
To save time we kick them off simultaneously. Also none of them are dependant on any of the others, so we figured it was safe to do so. It's been running fine for a couple of years but we do add about 10 to 20 jobs to this sequence a year. After more research and more failures I ran into this error:

Code: Select all

Program "DSD.Init": Line 41, Unable to allocate Type 30 descriptor, table is full.
DataStage Job 318 Phantom 28130
Job Aborted after Fatal Error logged.
Program "DSD.WriteLog": Line 250, Abort.
Attempting to Cleanup after ABORT raised in stage seqLMS_SRC..JobControl
We have our T30FILE property set to the default 200. Do you think changing this will fix the problem? Reason I ask is why would this same sequence run fine later on and not give the error above. I would think making the change to this uvconfig parameter would only help it it fails all the time consistently.

Posted: Tue Sep 25, 2007 5:04 pm
by ray.wurlod
Changing T30FILE is the solution to this problem. Increase it to 1000.
You will then need to stop, regenerate and restart the DataStage server.

T30FILE is the total number of dynamic hashed files that can be open simultaneously. Although you assert that your job designs do not use hashed files, the Repository tables are all hashed files; for every job that runs there are three or four hashed files open to record run-time process metadata.

Posted: Tue Sep 25, 2007 6:47 pm
by chulett
edwds wrote:Reason I ask is why would this same sequence run fine later on and not give the error above.
It's a resource constraint and is all about what is running in total at the time the error occurs. That's why it 'runs fine later on' and why changing the parameter, as Ray notes, is needed.

Posted: Thu Oct 11, 2007 8:28 am
by edwds
We changed the T30 File to 1000 and then the error changed to error -14. We then moved the jobs in the sequence around so that less run simultaneously. This solved the problem. Instead of running 100 jobs at the same time we are now down to about 75.

Posted: Thu Oct 11, 2007 2:29 pm
by ray.wurlod
Did you regenerate and restart DataStage after changing T30FILE? Execute the command analyze.shm -t | grep T30FILE to find out whether T30FILE has indeed been increased.