Page 1 of 1

dsjob utility PID not terminating/ending in UNIX environment

Posted: Tue Dec 16, 2014 4:07 pm
by december7
Hi,

I am running a parallel job which listens to Message queue continuously.

This job is triggered thru Autosys using below command.

dsjob -run -paramfile <param file> -wait -warn 10 -jobstatus <Project Name> <Parallel Job Name>

When an "End of message" is pushed on to the queue (after 12 Hours run time of parallel job), parallel job in the director is completed, but dsjob command is left running in the unix box.
So, underlying aytosys job is also not completing.

Inactivity timeout on the server (in administrator) is set to 4 hours.

I am looking for a solution to end the dsjob command also when the parallel job in the director is completed; this way autosys will end properly.
My option to change "Inactivity timeout" on the server (in administrator) is ruled out, as Admin says it impact all projects.

Thanks

Posted: Tue Dec 16, 2014 4:22 pm
by ray.wurlod
Try leaving out the -wait option.

The -jobstatus option will force dsjob not to return until the job finishes, so that its final exit status can be reported.

Posted: Wed Dec 17, 2014 12:51 pm
by december7
Hi,

I tried with below command

dsjob -run -paramfile <Param file> -warn 10 -jobstatus <project Name> <Job Name>

Still same behavior, even after completion of parallel job in director, still continues to have the PID for dsjob active in Unix.

Observation:
I see below PIDs returned till around 7 hours since I started the job.

ps -ef | grep <DS Job name>
asys 15835 13763 0 05:31 ? 00:00:00 dsjob -run -paramfile <Param file> -warn 10 -jobstatus <project Name> <Job Name>
asys 15840 15837 0 05:31 ? 00:00:00 phantom DSD.RUN <DS Job name>. 0/10/0/0/0
asys 15899 15840 0 05:31 ? 00:00:00 phantom DSD.OshMonitor <DS Job name> 15890 MSEVENTS.FALSE
asys 16211 16210 0 05:31 ? 00:00:01 /bin/sh <PX Engine path>/PXEngine/grid_enabled.5.0.2/remote_wait.sh 15890 <Grid path>/grid_jobdir/asys/141217/<DS Job Name>/0531/<DSJob Name>15890 0 10

After around 8-9 hours, I see below result from ps -ef

ps -ef | grep <DS Job name>
asys 15835 13763 0 05:31 ? 00:00:00 dsjob -run -paramfile <Param file> -warn 10 -jobstatus <project Name> <Job Name>

Thanks

Posted: Wed Dec 17, 2014 1:05 pm
by chulett
In your shoes I would involve support if you haven't done so already.

Posted: Wed Dec 17, 2014 1:06 pm
by PaulVL
Was process 15890 still active on the box?

Looks more like an issue with toolkit to me. I don't have 5.0 yet (actually getting it via shoe mail today from my IBM rep).

Posted: Wed Dec 17, 2014 2:44 pm
by december7
No Paul, process 15890 is gone after 8-9 hours.

Only PID 15835 is left.

Posted: Mon Dec 22, 2014 2:59 am
by boolseye
please fire below command
kill -9 PID
You should have root access or you can ask admins to do that.

Posted: Mon Dec 22, 2014 8:30 am
by chulett
I'm sure they're perfectly well aware of how to kill it, should they desire to do so. This conversation is more about the why of it, as in why it doesn't end on its own rather than needing to be killed.

December7, any luck with researching this? Any help from support?

Posted: Mon Jan 12, 2015 2:43 pm
by december7
No luck Chullet.
Opened a PMR with IBM, they are still looking into it.

Posted: Tue Jan 13, 2015 8:59 am
by PaulVL
I seem to recall someone from IBM telling me that the new handshaking for the Grid Enablement Toolkit was being done via named pipes that get created from the Compute Nodes back onto the Head Node via ssh. I wonder if that SSH connection timed out on you, and you are in a hung state. Can you see what the SSH timeout for inactivity is on your box?

Run a dummy job that sends an external source stage to your compute node, sleep 60 would be good. During the execution of the job, look at the SSH connections on the head node and compute nodes for your user id. That should provide a "normal pattern" for you to see. When you see your job hang, then you can see what the state of those ssh connections are.

Do short duraction jobs work in your environment but longer duration jobs fail/hang?

Does it only fail for 1 particular service ID or all?

Does it happen when the environment is busy or also when there is only 1 job running?

Can you reproduce in DEV?