Jobs are hanging can't kill PID

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
ggarze
Premium Member
Premium Member
Posts: 78
Joined: Tue Oct 11, 2005 9:37 am

Jobs are hanging can't kill PID

Post by ggarze »

The last 2 out of 3 days for some reason a job does not complete and hangs out there occupying all the CPU and the I/O reads climb up over a million. Any job after that just begins to hang as well. I try killing the uvsh.exe through task manager but nothing happens. When I try again it says "ACCESS Denied". Eventually there are a bunch of uvsh.exe out there and nothing is completing. Is there another way to kill these PIDS in windows. Should I be doing it this way?

Thanks,
Glenn
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Rather than hanging could it be writing out its write-delayed hashed files?

As for killing/stopping jobs, the Director STOP button is the first best recommendation. Usually a Task Manager kill is the second last thing to do, right before a server reboot.

Try the STOP button, try the Cleanup Resources in DS Director. Figure out why you thing the job is hanging. The other jobs "hanging" could just be a result of a fully utilized server.

Is this a single core server? Are all cores being slammed?
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
ggarze
Premium Member
Premium Member
Posts: 78
Joined: Tue Oct 11, 2005 9:37 am

Post by ggarze »

First off thanks for the reply...

the CPU's are dual core on one blade. Not sure what you mean by the "write-delayed hashed files". Could you explain or let me know if there is a post out there already with the info.

I definitely agree the other jobs hang because that one that is clocking away is just killing the box. Eventually when there are enough uvsh.exe running and new sequence that calls a server aborts with a timeout error trying to start the server job. It even gets to the point where the project gets Locked and we can't log in to DS.

We have tried the stop button and I'm not sure if we just don't give it enough time but it never realy seems to work quick enough. It's a production server and we are in a crunch to get things rolling asap. Unfortunately after the "end task" doesn't work it's reboot and rerun the jobs. I even tried in Admin to clear locks but that didn't seem to help either.

Just to give you a little more info. We have in our sequencers a function that runs at the end of every job that writes logs out to a share directory, reads logs for aborts and warnings and then sends the appropriate parties emails, and does some job resetting if needed. Also, inside the function we execute another sequencer that logs information to a database. Those jobs are all instanced by the job who is calling it. Anyway that would be my next place to look by maybe putting some DSLogInfo() calls in to see what parts of the post process have executed. This post process has been running for years without issue but we have noticed that when this hanging of jobs started it must be in the post process because all the server jobs within the sequencer show as finished yet the sequencer will not end.

So, besides the ways to stop a job in DS and the windows task manager, are you aware of any DOD command that might be ablet to kill a process? Oh yeah I tried stopping and starting services too but the uvsh.exe's still were out there.

Thanks,
Glenn
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Your post sheds more light. Once jobs are done the STOP button doesn't really work anymore while 3rd party external processing are running. So, if you're doing Routine calls to go out and run shell scripts, connect to databases via command line, etc, you're in a blackout zone. Until those commands return your job is waiting on these non-asynchronous calls.

That wouldn't explain mashing up the cpu and disks, which sounds like your external command processes are off doing things. You absolutely need verbous logging in your stuff to make them debug'able as much as possible.

Hashed files have write-delay caching capabilities to speed up processing, so if your job wrote a lot of rows to a cache you "hit the wall" at the end of the job when the flush occurs. This doesn't sound like your situation.

Windoze sucks, killing processes destablizes a flaky OS even more (yes I have my opinions). I don't know of anything helpful, someone else can chime in. The three finger salute reboot (that would be Ctrl-Alt-Del to the noobs out there) works best after enough processes are killed.

I would look to find the called process (not the DS job) that's doing the work and kill it.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
ggarze
Premium Member
Premium Member
Posts: 78
Joined: Tue Oct 11, 2005 9:37 am

Post by ggarze »

Okay here's an update as it happened again. It is actually hanging while calling an ETL Sequencer in my post process. This ETL job has multiple instances that it runs under and when I went into director all the hanging uvsh.exe seemed to be related to this job and it's instance. This sequencer calls another sequencer. The sequencer that it calls looks as though it entered it but never called the server job inside. The log entry that it hangs on is

"SEQETL0005.SEQCDS0002.JobControl (@Coordinator): Starting new run of checkpointed Sequence job"

This appears in every hanging job. (The instance is different of course)

Anyway the only thing i can think of when I saw this entry is for some reason with these jobs we decided to let DataStage handle the aborts and cleanups by checking those boxes in the job properities under the 'General' tab under 'Compilation Options'. Before whatever release this came out in we always handle resets and restarts through making developers use our sequence templates which included our post processes. Maybe we saw a new option and we figured we'd try it for this utility job, who knows. To me now it seems like it's being caused possibly trying to figure out this 'checkpoint' thing. Maybe something is corrupt now in a system file for this job, anyway we are going to remove all those check boxes from the job and see how it goes.

On the other hand, do you think that too many process could just be calling this post process at the same time causing the issue despite each having it's own instance?

Thanks,
Glenn
Post Reply