dsjob utility PID not terminating/ending in UNIX environment
Moderators: chulett, rschirm, roy
dsjob utility PID not terminating/ending in UNIX environment
Hi,
I am running a parallel job which listens to Message queue continuously.
This job is triggered thru Autosys using below command.
dsjob -run -paramfile <param file> -wait -warn 10 -jobstatus <Project Name> <Parallel Job Name>
When an "End of message" is pushed on to the queue (after 12 Hours run time of parallel job), parallel job in the director is completed, but dsjob command is left running in the unix box.
So, underlying aytosys job is also not completing.
Inactivity timeout on the server (in administrator) is set to 4 hours.
I am looking for a solution to end the dsjob command also when the parallel job in the director is completed; this way autosys will end properly.
My option to change "Inactivity timeout" on the server (in administrator) is ruled out, as Admin says it impact all projects.
Thanks
I am running a parallel job which listens to Message queue continuously.
This job is triggered thru Autosys using below command.
dsjob -run -paramfile <param file> -wait -warn 10 -jobstatus <Project Name> <Parallel Job Name>
When an "End of message" is pushed on to the queue (after 12 Hours run time of parallel job), parallel job in the director is completed, but dsjob command is left running in the unix box.
So, underlying aytosys job is also not completing.
Inactivity timeout on the server (in administrator) is set to 4 hours.
I am looking for a solution to end the dsjob command also when the parallel job in the director is completed; this way autosys will end properly.
My option to change "Inactivity timeout" on the server (in administrator) is ruled out, as Admin says it impact all projects.
Thanks
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Hi,
I tried with below command
dsjob -run -paramfile <Param file> -warn 10 -jobstatus <project Name> <Job Name>
Still same behavior, even after completion of parallel job in director, still continues to have the PID for dsjob active in Unix.
Observation:
I see below PIDs returned till around 7 hours since I started the job.
ps -ef | grep <DS Job name>
asys 15835 13763 0 05:31 ? 00:00:00 dsjob -run -paramfile <Param file> -warn 10 -jobstatus <project Name> <Job Name>
asys 15840 15837 0 05:31 ? 00:00:00 phantom DSD.RUN <DS Job name>. 0/10/0/0/0
asys 15899 15840 0 05:31 ? 00:00:00 phantom DSD.OshMonitor <DS Job name> 15890 MSEVENTS.FALSE
asys 16211 16210 0 05:31 ? 00:00:01 /bin/sh <PX Engine path>/PXEngine/grid_enabled.5.0.2/remote_wait.sh 15890 <Grid path>/grid_jobdir/asys/141217/<DS Job Name>/0531/<DSJob Name>15890 0 10
After around 8-9 hours, I see below result from ps -ef
ps -ef | grep <DS Job name>
asys 15835 13763 0 05:31 ? 00:00:00 dsjob -run -paramfile <Param file> -warn 10 -jobstatus <project Name> <Job Name>
Thanks
I tried with below command
dsjob -run -paramfile <Param file> -warn 10 -jobstatus <project Name> <Job Name>
Still same behavior, even after completion of parallel job in director, still continues to have the PID for dsjob active in Unix.
Observation:
I see below PIDs returned till around 7 hours since I started the job.
ps -ef | grep <DS Job name>
asys 15835 13763 0 05:31 ? 00:00:00 dsjob -run -paramfile <Param file> -warn 10 -jobstatus <project Name> <Job Name>
asys 15840 15837 0 05:31 ? 00:00:00 phantom DSD.RUN <DS Job name>. 0/10/0/0/0
asys 15899 15840 0 05:31 ? 00:00:00 phantom DSD.OshMonitor <DS Job name> 15890 MSEVENTS.FALSE
asys 16211 16210 0 05:31 ? 00:00:01 /bin/sh <PX Engine path>/PXEngine/grid_enabled.5.0.2/remote_wait.sh 15890 <Grid path>/grid_jobdir/asys/141217/<DS Job Name>/0531/<DSJob Name>15890 0 10
After around 8-9 hours, I see below result from ps -ef
ps -ef | grep <DS Job name>
asys 15835 13763 0 05:31 ? 00:00:00 dsjob -run -paramfile <Param file> -warn 10 -jobstatus <project Name> <Job Name>
Thanks
I'm sure they're perfectly well aware of how to kill it, should they desire to do so. This conversation is more about the why of it, as in why it doesn't end on its own rather than needing to be killed.
December7, any luck with researching this? Any help from support?
December7, any luck with researching this? Any help from support?
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
I seem to recall someone from IBM telling me that the new handshaking for the Grid Enablement Toolkit was being done via named pipes that get created from the Compute Nodes back onto the Head Node via ssh. I wonder if that SSH connection timed out on you, and you are in a hung state. Can you see what the SSH timeout for inactivity is on your box?
Run a dummy job that sends an external source stage to your compute node, sleep 60 would be good. During the execution of the job, look at the SSH connections on the head node and compute nodes for your user id. That should provide a "normal pattern" for you to see. When you see your job hang, then you can see what the state of those ssh connections are.
Do short duraction jobs work in your environment but longer duration jobs fail/hang?
Does it only fail for 1 particular service ID or all?
Does it happen when the environment is busy or also when there is only 1 job running?
Can you reproduce in DEV?
Run a dummy job that sends an external source stage to your compute node, sleep 60 would be good. During the execution of the job, look at the SSH connections on the head node and compute nodes for your user id. That should provide a "normal pattern" for you to see. When you see your job hang, then you can see what the state of those ssh connections are.
Do short duraction jobs work in your environment but longer duration jobs fail/hang?
Does it only fail for 1 particular service ID or all?
Does it happen when the environment is busy or also when there is only 1 job running?
Can you reproduce in DEV?