OS Trouble shooting during 10 jobs running in pararell

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
teety
Participant
Posts: 7
Joined: Wed Sep 03, 2003 12:52 am
Location: Sydney

OS Trouble shooting during 10 jobs running in pararell

Post by teety »

I have sunfire server with 6CPUs and 8GB rams. I try to run 10 pararell jobs simultaneously. After running for a while, some jobs were aborted and shown this error message
1. "Abnormal termination of stage FCP02RF0111800..hrefFCPORGTPID.IDENT5 detected" This job does read data from oracle , then create hashed file.
2. "LAR99CT0111700..JobControl (fatal error from DSRunJob): Job control fatal error (-14)
(DSRunJob) Job LAR05TS0611700 appears not to have started after 60 secs" This job is the batch control job which will call the child job but the child job could not be started (with unknown reason ... I think). The batch job was aborted finally.
Normally, I run all the batch jobs in sequence and never face this kind of problems before.
Could anyone help me out? I need your suggestion asap.

Thank you in advance.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Read the message again.

You're trying to get the machine to do more than it can successfully accomplish. 8GB memory is rather low for a SunFire system with 6CPUs.

DataStage has a hard-wired time limit on the DSRunJob function. If it is unable to start the job within 60 seconds, the message you received is generated. I expect that some, if not all, of these ten jobs make strong demands for resources. Inability to allocate resources is by far the most common cause of this error.

Try running no more than six jobs simultaneously.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

I bet the T30FILES value is too low in the uvconfig file. You probably will need to up this if you're using a lot of dynamic hash files. Remember, each job uses a couple of dynamic hash files just by their nature (logs, status, config, etc), in addition to any your design uses.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
teety
Participant
Posts: 7
Joined: Wed Sep 03, 2003 12:52 am
Location: Sydney

Post by teety »

I set MFILEs = 91 and T30FILES=1000. And parameter ULIMIT = unlimited. Is it enough? Or could you recommend the appropriate value to us?

Thanks
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Let us know if your problem is still around.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
asnagaraj
Participant
Posts: 26
Joined: Wed Jun 25, 2003 12:41 am

Same Problem

Post by asnagaraj »

I have experienced a similar problem.

Batch::MasterJobControl..JobControl (fatal error from DSRunJob): Job control fatal error (-14)
(DSRunJob) Job Batch::ETLJobControl appears not to have started after 60 secs

But the ETLJobControl job started in the next 2 seconds, i am afraid why i get such a message. Any replies would be of help

Thanks
Naga.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

If the job started 62 seconds after being asked by the job control function, then it went outside the tolerance the API has hard-coded into it. This problem usually occurs on a system so overwhelmed with tasks. You need to look at resource availability and see if that is when your problem is happening.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
asnagaraj
Participant
Posts: 26
Joined: Wed Jun 25, 2003 12:41 am

Post by asnagaraj »

It was not 62 seconds. It was just 2 seconds after the Job control routine started. But the JCR aborted mentioning that the job didnt start. The invoked job started and got completed (Status = Ok) within 2 minutes.
Post Reply