Job control process (pid 1084) has failed

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
rmcclure
Participant
Posts: 48
Joined: Fri Dec 01, 2006 7:50 am

Job control process (pid 1084) has failed

Post by rmcclure »

Hi,

I am having a very frustrating problem:

We have a sequence job that runs various server jobs and other sequence jobs. This job is set up to give an email notification if it fails.
Sometimes the job will not fail but not complete.
For example
mainSequencejob runs various server jobs then runs sequencejob1 which runs serverjob1 and serverjob2 than mainSequencejob moves onto other server and sequence jobs.
Both serverjob1 and serverjob2 complete successfully and sequencejob1 completes sussessfully but mainSequencejob has a warning: "Job control process (pid 1084) has failed"
The frustrating part is this happens sometime during the night but there is no email notification. As soon as someone logs into datastage director and goes to view the logs the warning appears and the email is sent. Often the job will then continue, so I get a sequence job with a status "aborted" but the job is still running. It is almost as if the whole ETL job is in limbo until someone logs in.
We also can't reproduce it. It will happen one day and not the next.


I'm taking a wild guess that our AS/400 is dropping the process and Datastage is not being informed. Since company policy does not allow me to look at the production server and the sys-admins saying "no nothing happened last night" I can only guess
I don't understand why datastage seems to sit and wait if the process ID has been dropped.
Do you think this a Datastage issue or a AS/400 issue?

Stats:
The Source DB is AS/400 DB2
Target DW is SQL server 2005
We connect using ODBC
Datastage version is 7.5.1
Aruna Gutti
Premium Member
Premium Member
Posts: 145
Joined: Fri Sep 21, 2007 9:35 am
Location: Boston

Post by Aruna Gutti »

I think it is a DataStage issue. I just got the same error which disappeared after I cleared the lock on one of the jobs in the sequence.
tonystark622
Premium Member
Premium Member
Posts: 483
Joined: Thu Jun 12, 2003 4:47 pm
Location: St. Louis, Missouri USA

Post by tonystark622 »

I am currently having the same problem.

I think I have a line on what's going on.

1) IBM sent me a patch for jobs that "deadlock". I can't find the ecase number right now. If I find it, I'll post it.

2) My UNIX admin folks found out that the system was rebooting for a weekly reboot, while my Job Sequencer job was running. Several hours later the Job Controller job gets "Job control process (pid xxxx) has failed." I moved the time my job executes to an earlier time and we didn't get the error this weekend.

Hope this helps,
Tony
tonystark622
Premium Member
Premium Member
Posts: 483
Joined: Thu Jun 12, 2003 4:47 pm
Location: St. Louis, Missouri USA

Post by tonystark622 »

This problem has been resolved.

The main job sequence started running at 3:00am and usually took a little over 1 hour to run.

Unknown to me, sometime around 4:00am on Monday the UNIX system was rebooted. This was a "normal" weekly reboot.

Apparently, some time later, the DataStage engine realized that the main job process wasn't running and aborted the job logging the "Job control process (pid xxxx) has failed" message.
Post Reply