dsjob utility PID not terminating/ending in UNIX environment

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

Post Reply
december7
Premium Member
Premium Member
Posts: 42
Joined: Thu May 19, 2005 6:38 pm

dsjob utility PID not terminating/ending in UNIX environment

Post by december7 »

Hi,

I am running a parallel job which listens to Message queue continuously.

This job is triggered thru Autosys using below command.

dsjob -run -paramfile <param file> -wait -warn 10 -jobstatus <Project Name> <Parallel Job Name>

When an "End of message" is pushed on to the queue (after 12 Hours run time of parallel job), parallel job in the director is completed, but dsjob command is left running in the unix box.
So, underlying aytosys job is also not completing.

Inactivity timeout on the server (in administrator) is set to 4 hours.

I am looking for a solution to end the dsjob command also when the parallel job in the director is completed; this way autosys will end properly.
My option to change "Inactivity timeout" on the server (in administrator) is ruled out, as Admin says it impact all projects.

Thanks
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Try leaving out the -wait option.

The -jobstatus option will force dsjob not to return until the job finishes, so that its final exit status can be reported.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
december7
Premium Member
Premium Member
Posts: 42
Joined: Thu May 19, 2005 6:38 pm

Post by december7 »

Hi,

I tried with below command

dsjob -run -paramfile <Param file> -warn 10 -jobstatus <project Name> <Job Name>

Still same behavior, even after completion of parallel job in director, still continues to have the PID for dsjob active in Unix.

Observation:
I see below PIDs returned till around 7 hours since I started the job.

ps -ef | grep <DS Job name>
asys 15835 13763 0 05:31 ? 00:00:00 dsjob -run -paramfile <Param file> -warn 10 -jobstatus <project Name> <Job Name>
asys 15840 15837 0 05:31 ? 00:00:00 phantom DSD.RUN <DS Job name>. 0/10/0/0/0
asys 15899 15840 0 05:31 ? 00:00:00 phantom DSD.OshMonitor <DS Job name> 15890 MSEVENTS.FALSE
asys 16211 16210 0 05:31 ? 00:00:01 /bin/sh <PX Engine path>/PXEngine/grid_enabled.5.0.2/remote_wait.sh 15890 <Grid path>/grid_jobdir/asys/141217/<DS Job Name>/0531/<DSJob Name>15890 0 10

After around 8-9 hours, I see below result from ps -ef

ps -ef | grep <DS Job name>
asys 15835 13763 0 05:31 ? 00:00:00 dsjob -run -paramfile <Param file> -warn 10 -jobstatus <project Name> <Job Name>

Thanks
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

In your shoes I would involve support if you haven't done so already.
-craig

"You can never have too many knives" -- Logan Nine Fingers
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Was process 15890 still active on the box?

Looks more like an issue with toolkit to me. I don't have 5.0 yet (actually getting it via shoe mail today from my IBM rep).
december7
Premium Member
Premium Member
Posts: 42
Joined: Thu May 19, 2005 6:38 pm

Post by december7 »

No Paul, process 15890 is gone after 8-9 hours.

Only PID 15835 is left.
boolseye
Participant
Posts: 18
Joined: Mon Jul 15, 2013 4:01 am

Post by boolseye »

please fire below command
kill -9 PID
You should have root access or you can ask admins to do that.
-----------------
Thanks
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I'm sure they're perfectly well aware of how to kill it, should they desire to do so. This conversation is more about the why of it, as in why it doesn't end on its own rather than needing to be killed.

December7, any luck with researching this? Any help from support?
-craig

"You can never have too many knives" -- Logan Nine Fingers
december7
Premium Member
Premium Member
Posts: 42
Joined: Thu May 19, 2005 6:38 pm

Post by december7 »

No luck Chullet.
Opened a PMR with IBM, they are still looking into it.
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

I seem to recall someone from IBM telling me that the new handshaking for the Grid Enablement Toolkit was being done via named pipes that get created from the Compute Nodes back onto the Head Node via ssh. I wonder if that SSH connection timed out on you, and you are in a hung state. Can you see what the SSH timeout for inactivity is on your box?

Run a dummy job that sends an external source stage to your compute node, sleep 60 would be good. During the execution of the job, look at the SSH connections on the head node and compute nodes for your user id. That should provide a "normal pattern" for you to see. When you see your job hang, then you can see what the state of those ssh connections are.

Do short duraction jobs work in your environment but longer duration jobs fail/hang?

Does it only fail for 1 particular service ID or all?

Does it happen when the environment is busy or also when there is only 1 job running?

Can you reproduce in DEV?
Post Reply