Page 1 of 1

job control process (pid xxxx) has failed

Posted: Fri May 27, 2016 4:33 am
by wuruima
I met a warning msg "job control process (pid xxxx) has failed" and then the job abort. After search in the IBM, I found this.

Problem(Abstract)

Sequence job control process (pid xxxx) has failed

Cause


Sequence job run continuously in a loop, appends to the dsenv after each run, causing the length of your LD_LIBRARY_PATH (Sun/Linux), LIBPATH (AIX), LIB_PATH (HPUX) environment variable, to exceeded 8192 bytes.

Diagnosing the problem

If, after actioning steps in Technote http://www-01.ibm.com/support/docview.w ... wg21397247, the issue persists and you are running a Sequence job continuously in a loop, then the next action is to check the length of your LD_LIBRARY_PATH (Sun/Linux), LIBPATH (AIX), LIB_PATH (HPUX) environment variable, ensure the length this string has NOT exceeded 8192 bytes.

If it has, then the likely cause is that the dsenv is being sourced continuously in a loop as well.

Resolving the problem

Set the environment settings outside the loop (or) set the absolute-strings (such as "LD_LIBRARY_PATH=<all-paths>", but do not append this with :$LD_LIBRARY_PATH, which can cause the path-settings to get repeated on multiple-runs & finally cause the crash.

Re: how to understand this error

Posted: Fri May 27, 2016 4:39 am
by wuruima
I design a seq job, which only contains a routine. In the routine, firstly I trigger job A, B, C, D to run one by one.(use a for loop)
And then I have a for loop from 1-9, to submit job index1...index9 to run parallelly.

This is the log where it abort.

[info]a..JobControl (DSRunJob): Waiting for job index1 to start
[warn]Job control process (pid 28967492) has failed

Re: how to understand this error

Posted: Fri May 27, 2016 4:53 am
by wuruima
I simply rerun the job ,without change. Now it's processing job 1-9. no error.

Posted: Fri May 27, 2016 6:35 am
by chulett
So... the sequence job itself had the PID failure or one of the jobs it attempted to run had the failure? For the latter, anything in that job's log? :?

For an intermittent error like this, something you can't reproduce, in your shoes I would involve support.

Posted: Fri May 27, 2016 7:23 am
by chulett
However, I will say that in my experience when you see something like this:

A fails. Sometime later with no changes or intervention, A runs fine.

This is usually resource related. As in a lack thereof.

Posted: Sun May 29, 2016 7:03 pm
by wuruima
Yes recently the DS env encounter out of resource problem sometimes.

The error msg will have some words like "resource", however the error msg above is not easy to understand.

Posted: Sun May 29, 2016 7:35 pm
by wuruima
[info]a..JobControl (DSRunJob): Waiting for job index1 to start
[warn]Job control process (pid 28967492) has failed

After the log, nothing special but shows the sequence job is abort.

Posted: Wed Jun 01, 2016 12:32 pm
by Teej
We actually dislike this kind of support tickets. "It failed, and then work again, do our work for us!"

Get a consultant to help diagnosis the system issue, if you do not have the appropriate resource that is skilled enough to do an evaluation of your server. Do not lean on IBM Support without specific details, "Why is running x, y, z producing action a, b, c on this server?"

Tickets that complain that it failed then worked, with no further investigation done, will most likely require specific consulting assistance to be done. It is your server, which is so unlike most of our other customers' servers, with different settings, configurations, and software installed. We need you to investigate how you set it up, and find out what is going on on the system level, before we can help explain the why.

Posted: Thu Jun 02, 2016 8:46 am
by PaulVL
My money (2 cents) is on the ulimit of the user id running the job. Check the nofile value. I am guessing it is the default 1024. Which is way to low for an ETL environment.

Posted: Tue Jun 14, 2016 1:45 am
by wuruima
thanks for ur long response.
The job was failed with a message I could not understand, eventhough I get the explaination in the IBM website, I could not make it clear, that's why I send the post here. I suspect this is a server resource issue, but who knows. After the rerun the job resumed, I just want to know "what the error means".
Teej wrote:We actually dislike this kind of support tickets. "It failed, and then work again, do our work for us!"

Get a consultant to help diagnosis the system issue, if you do not have the appropriate resource that is skilled enough to do an evaluation of your server. Do not lean on IBM Support without specific details, "Why is running x, y, z producing action a, b, c on this server?"

Tickets that complain that it failed then worked, with no further investigation done, will most likely require specific consulting assistance to be done. It is your server, which is so unlike most of our other customers' servers, with different settings, configurations, and software installed. We need you to investigate how you set it up, and find out what is going on on the system level, before we can help explain the why.