Job control process (pid 1084) has failed

rmcclure · Post by **rmcclure** » Mon Nov 03, 2008 10:04 am

Hi,

I am having a very frustrating problem:

We have a sequence job that runs various server jobs and other sequence jobs. This job is set up to give an email notification if it fails.
Sometimes the job will not fail but not complete.
For example
mainSequencejob runs various server jobs then runs sequencejob1 which runs serverjob1 and serverjob2 than mainSequencejob moves onto other server and sequence jobs.
Both serverjob1 and serverjob2 complete successfully and sequencejob1 completes sussessfully but mainSequencejob has a warning: "Job control process (pid 1084) has failed"
The frustrating part is this happens sometime during the night but there is no email notification. As soon as someone logs into datastage director and goes to view the logs the warning appears and the email is sent. Often the job will then continue, so I get a sequence job with a status "aborted" but the job is still running. It is almost as if the whole ETL job is in limbo until someone logs in.
We also can't reproduce it. It will happen one day and not the next.

I'm taking a wild guess that our AS/400 is dropping the process and Datastage is not being informed. Since company policy does not allow me to look at the production server and the sys-admins saying "no nothing happened last night" I can only guess
I don't understand why datastage seems to sit and wait if the process ID has been dropped.
Do you think this a Datastage issue or a AS/400 issue?

Stats:
The Source DB is AS/400 DB2
Target DW is SQL server 2005
We connect using ODBC
Datastage version is 7.5.1

Aruna Gutti · Post by **Aruna Gutti** » Mon Dec 15, 2008 12:14 pm

I think it is a DataStage issue. I just got the same error which disappeared after I cleared the lock on one of the jobs in the sequence.

tonystark622 · Post by **tonystark622** » Mon Dec 15, 2008 12:31 pm

I am currently having the same problem.

I think I have a line on what's going on.

1) IBM sent me a patch for jobs that "deadlock". I can't find the ecase number right now. If I find it, I'll post it.

2) My UNIX admin folks found out that the system was rebooting for a weekly reboot, while my Job Sequencer job was running. Several hours later the Job Controller job gets "Job control process (pid xxxx) has failed." I moved the time my job executes to an earlier time and we didn't get the error this weekend.

Hope this helps,
Tony

tonystark622 · Post by **tonystark622** » Thu Jan 08, 2009 2:58 pm

This problem has been resolved.

The main job sequence started running at 3:00am and usually took a little over 1 hour to run.

Unknown to me, sometime around 4:00am on Monday the UNIX system was rebooted. This was a "normal" weekly reboot.

Apparently, some time later, the DataStage engine realized that the main job process wasn't running and aborted the job logging the "Job control process (pid xxxx) has failed" message.