PID Failed

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

ashik_punar
Premium Member
Premium Member
Posts: 71
Joined: Mon Nov 13, 2006 12:40 am

PID Failed

Post by ashik_punar »

Hi Everyone,

I am facing some problem with the PID of a job. When i am trying to run a sequencer in which i am having nearly 25 jobs. The first job in the sequencer is job1 which i am running first of all and no other job is running parallel to this job after this job completes i am trying to run job2 and job3 parallely. In case of job2 i am getting the PID failed error. The log for the sequencer looks like this:

Control Starting Job Job2 (...)
Warning Job control process (pid 112431) has failed
Control Job Job2.aborted


In order to get some solution from the IBM support we sent the description of thsi error to them and they are asking us to:

clear the &PH& folder.

We are not able to find this folder, at the same time we are not able to know why we are getting this PID failed error and what could be the possible solution for the same.
If anyone is having any information about the folder or this problem then please do provide some help on this.

Thanks in advance,
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

the &PH& will be present in your projects directory. It has all the run time info. Well not all but some. Go ahead and clear that out.
What error messages are you getting in those two particular jobs?
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
ashik_punar
Premium Member
Premium Member
Posts: 71
Joined: Mon Nov 13, 2006 12:40 am

Post by ashik_punar »

Hi,

Thanks a lot for the quick reply.

I was able to find this directory, I am not the owner of the directory.Is there any particular way to clean this directory,sorry for asking something like this as i am not having much information about this directory.

The jobs which are getting aborted are not giving me anny other error apart from the PID failed warning and then the job gets aborted. I am running the jobs on 4-nodes.


Thanks again,
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

How are you clearing &PH&. You need to go to TCL and issue CLEAR.FILE &PH& command from within your project. Make sure no jobs are running at that time.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Actually, the &PH& project subdirectory can be cleared from UNIX, even when jobs are running.
ashik_punar
Premium Member
Premium Member
Posts: 71
Joined: Mon Nov 13, 2006 12:40 am

Post by ashik_punar »

Hi DSguru/ArndW,

I don't have any idea how to goto TCL,I am really very sorry for the same. Can you please guide me on this.

ArndW,

you were saying that we can clear this subdirectory from UNIX also even when the jobs are running.Please guide me on this also.

I am really very thankful for all the inputs that you are giving in.
Thanks a lot for helping me out.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Open your Administrator client. Select your project. Click on the Command button. This opens a window in which your can enter "TCL" commands. Enter the command CLEAR.FILE &PH& and await a response. Close the command window.

PID in this context means "process ID" - this is not the cause of the problem. The problem is that Job2 aborted, and the job control process (job sequence?), which was executing Job2 with process ID 112431, is reporting that fact to you.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

You can use your UNIX "rm" command to remove files in that directory. Nothing untoward will happen if you delete the open file for a running job - except that any information about the running job will be lost. I would use a filter on that directory that just removes anything older than a day to sure.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Do you have to kill those jobs or they abort themselves? Reset the job and see if any additional messages pop up.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
ashik_punar
Premium Member
Premium Member
Posts: 71
Joined: Mon Nov 13, 2006 12:40 am

Post by ashik_punar »

Hi Ray/ArndW/DSguru,

Thanks to you all for all the help that you have been extending.

The problem is that this PID Failed error is not occurring in a single job only. As i wrote i have 25 jobs in the sequencer, so sometimes this error is coming in the first job, sometimes it is coming in the 5th job and sometimes in some other job. What i mean to say is that it can occur in any of the 25 jobs. Last time when i ran the sequencer again it gave me the same problem and the log for the sequencer looks like this:

Occurred: 2:21:32 PM On date: 12/5/2006 Type: Control
Event: Starting Job Seq1. (...)

Occurred: 2:21:32 PM On date: 12/5/2006 Type: Info
Event: Environment variable settings: (...)

Occurred: 2:21:33 PM On date: 12/5/2006 Type: Info
Event: Seq1..JobControl (@Coordinator): Starting new run of checkpointed Sequence job

Occurred: 2:21:33 PM On date: 12/5/2006 Type: RunJob
Event: Seq1 -> (Sybase_FINCAFL_Load_Job): Job run requested (...)

Occurred: 2:21:33 PM On date: 12/5/2006 Type: Info
Event: Seq1..JobControl (DSRunJob): Waiting for job Sybase_FINCAFL_Load_Job to start

Occurred: 2:21:34 PM On date: 12/5/2006 Type: Info
Event: Seq1..JobControl (DSWaitForJob): Waiting for job Sybase_FINCAFL_Load_Job to finish

Occurred: 2:29:29 PM On date: 12/5/2006 Type: Info
Event: Seq1..JobControl (DSWaitForJob): Job Sybase_FINCAFL_Load_Job has finished, status = 1 (Finished OK)

Occurred: 2:29:30 PM On date: 12/5/2006 Type: Info
Event: Seq1..JobControl (@Sybase_FINCAFL_Load_Job): Report on job: Sybase_FINCAFL_Load_Job (...)

Occurred: 2:29:30 PM On date: 12/5/2006 Type: Info
Event: Seq1..JobControl (@Sybase_FINCAFL_Load_Job): Checkpointed run of job 'Sybase_FINCAFL_Load_Job'

Occurred: 2:29:30 PM On date: 12/5/2006 Type: RunJob
Event: Seq1 -> (Sybase_FINHDR_Load_job): Job run requested (...)

Occurred: 2:29:30 PM On date: 12/5/2006 Type: Info
Event: Seq1..JobControl (DSRunJob): Waiting for job Sybase_FINHDR_Load_job to start

Occurred: 2:29:31 PM On date: 12/5/2006 Type: RunJob
Event: Seq1 -> (Sybase_FINASST_Load_Job): Job run requested (...)

Occurred: 2:29:31 PM On date: 12/5/2006 Type: Info
Event: Seq1..JobControl (DSRunJob): Waiting for job Sybase_FINASST_Load_Job to start

Occurred: 2:29:32 PM On date: 12/5/2006 Type: Info
Event: Seq1..JobControl (DSWaitForJob): Waiting for job Sybase_FINHDR_Load_job+Sybase_FINASST_Load_Job to finish

Occurred: 2:29:32 PM On date: 12/5/2006 Type: Info
Event: Seq1..JobControl (DSWaitForJob): Job Sybase_FINHDR_Load_job has finished, status = 3 (Aborted)

Occurred: 2:29:32 PM On date: 12/5/2006 Type: Warning
Event: Seq1..JobControl (@Sybase_FINHDR_Load_job): Job Sybase_FINHDR_Load_job did not finish OK, status = 'Aborted'

Occurred: 2:29:32 PM On date: 12/5/2006 Type: Info
Event: Seq1..JobControl (@Sybase_FINHDR_Load_job): Report on job: Sybase_FINHDR_Load_job (...)

Occurred: 2:29:32 PM On date: 12/5/2006 Type: Warning
Event: Seq1..JobControl (@Sybase_FINHDR_Load_job): Controller problem: Unhandled abort encountered in job Sybase_FINHDR_Load_job

Occurred: 2:29:32 PM On date: 12/5/2006 Type: Info
Event: Seq1..JobControl (@Sybase_FINHDR_Load_job): Will execute error activity: FailCase_EA

Occurred: 2:29:33 PM On date: 12/5/2006 Type: Info
Event: Seq1..JobControl (DSSendMail): Sent message to 'punardeeps@hcl.in,ashikm@hcl.in,amalarpova@hcl.in' (...)

Occurred: 2:29:33 PM On date: 12/5/2006 Type: Info
Event: Seq1..JobControl (@ExceptionMail_NA): Omitted checkpoint for call of routine 'DSSendMail'

Occurred: 2:29:33 PM On date: 12/5/2006 Type: Info
Event: Seq1..JobControl (DSWaitForJob): Waiting for job Sybase_FINASST_Load_Job to finish

Occurred: 2:29:33 PM On date: 12/5/2006 Type: Info
Event: Seq1..JobControl (DSWaitForJob): Job Sybase_FINASST_Load_Job has finished, status = 3 (Aborted)

Occurred: 2:29:33 PM On date: 12/5/2006 Type: Warning
Event: Seq1..JobControl (@Sybase_FINASST_Load_Job): Job Sybase_FINASST_Load_Job did not finish OK, status = 'Aborted'

Occurred: 2:29:33 PM On date: 12/5/2006 Type: Info
Event: Seq1..JobControl (@Sybase_FINASST_Load_Job): Report on job: Sybase_FINASST_Load_Job (...)

Occurred: 2:29:33 PM On date: 12/5/2006 Type: Info
Event: Seq1..JobControl (@Coordinator): Summary of sequence run (...)

Occurred: 2:29:33 PM On date: 12/5/2006 Type: Fatal
Event: Seq1..JobControl (fatal error from @Coordinator): Sequence job (restartable) will abort due to previous unrecoverable errors

Occurred: 2:29:33 PM On date: 12/5/2006 Type: Warning
Event: Attempting to Cleanup after ABORT raised in stage Seq1..JobControl

Occurred: 2:29:33 PM On date: 12/5/2006 Type: Control
Event: Job Seq1 aborted.

End of report.

So 2 jobs got the PID Failed error, the log for the first job(Sybase_FINHDR_Load_job) looks like this:

Occurred: 2:29:30 PM On date: 12/5/2006 Type: Control
Event: Starting Job Sybase_FINHDR_Load_job. (...)

Occurred: 2:29:32 PM On date: 12/5/2006 Type: Warning
Event: Job control process (pid 164150) has failed

Occurred: 2:29:32 PM On date: 12/5/2006 Type: Control
Event: Job Sybase_FINHDR_Load_job. aborted


and the log for the second job(Sybase_FINASST_Load_Job) looks like this:

Occurred: 2:29:32 PM On date: 12/5/2006 Type: Control
Event: Starting Job Sybase_FINASST_Load_Job. (...)

Occurred: 2:29:33 PM On date: 12/5/2006 Type: Warning
Event: Job control process (pid 1781864) has failed

Occurred: 2:29:33 PM On date: 12/5/2006 Type: Control
Event: Job Sybase_FINASST_Load_Job. aborted

End of report.

I am sorry for posting such a long post.But,I think with all this description i will be able to explain my problem. The main issue is that i am getting the PID failed error randomly in jobs. one time it is in one job and the other time it is in some other job. Can you please tell me what could be the possible reason for the same? I am not able to solve this thing from quite some time. I believe with your help I will be able to get through the same.

Thanks a lot for all your help.
ashik_punar
Premium Member
Premium Member
Posts: 71
Joined: Mon Nov 13, 2006 12:40 am

Post by ashik_punar »

I am not doing anything to the jobs.They are getting this warning in their log and getting aborted.I have reset them and tried to run them again also.Some times they run and sometimes they again give this error. While reseting i don't get any other messages.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What warning and error messages appear in the job logs of the jobs that aborted, such as Sybase_FINHDR_Load_job or Sybase_FINASST_Load_Job ?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ashik_punar
Premium Member
Premium Member
Posts: 71
Joined: Mon Nov 13, 2006 12:40 am

Post by ashik_punar »

Hi Ray,

For the job Sybase_FINHDR_Load_job, i am getting the following log,this is the full log for the job run:

Occurred: 2:29:30 PM On date: 12/5/2006 Type: Control
Event: Starting Job Sybase_FINHDR_Load_job. (...)

Occurred: 2:29:32 PM On date: 12/5/2006 Type: Warning
Event: Job control process (pid 164150) has failed

Occurred: 2:29:32 PM On date: 12/5/2006 Type: Control
Event: Job Sybase_FINHDR_Load_job. aborted
-----------------------------------------------------------
For the job Sybase_FINASST_Load_Job, i am getting the following log,this is the full log for the job run:

Occurred: 2:29:32 PM On date: 12/5/2006 Type: Control
Event: Starting Job Sybase_FINASST_Load_Job. (...)

Occurred: 2:29:33 PM On date: 12/5/2006 Type: Warning
Event: Job control process (pid 1781864) has failed

Occurred: 2:29:33 PM On date: 12/5/2006 Type: Control
Event: Job Sybase_FINASST_Load_Job. aborted
-----------------------------------------------------------

These are the logs for the 2 jobs which got aborted.

I hope i was able to give you the information whcih you asked for.

Thanks a lot for all your inputs.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Not much to go on there. Are Sybase_FINHDR_Load_job and Sybase_FINASST_Load_Job jobs or job sequences? In either case, please set APT_PM_SHOW_PIDS to True before executing again - that way you will be able to work out which process was executing what.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
thebird
Participant
Posts: 254
Joined: Thu Jan 06, 2005 12:11 am
Location: India
Contact:

Post by thebird »

Hi All,

I had come across this sometime back, and the issue that we found causing this, was weird.

In my case, i had a sequence which was calling about 5 jobs - and when the Sequence was Run it would abort with a PID failure error - and sometimes it would run fine.

What we found was that - the environment variable - $APT_DUMP_SCORE - was set to TRUE in the job while as at a project level it was set to FALSE.

When the value was either set to TRUE at a project level, or set to FALSE in the job, the sequence would run fine and has been running fine (with the value set to FALSE in the job). They had opened a case with IBM regarding this, but not sure what happened of it later on.

Maybe its worth a try checking out if there are any such Environment variables in the job....

Aneesh
Post Reply