defunct and ghost jobs

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
Paul Preston
Participant
Posts: 24
Joined: Wed Apr 02, 2003 7:09 am
Location: United Kingdom

defunct and ghost jobs

Post by Paul Preston »

We have 2 instances of a datastage job running simultaneously on a Sun Solaris server running Datstage 6.01. There are never more than 2 jobs running. When one job ends it starts another one running provided there are no more than 2 jobs running. All these jobs finish with no errors or warnings and they do what we expect them to do.

However, we find that when typing ps -ef after a few minutes (when several jobs will have completed) we see hundreds of defunct processes and old job instances that have actually finished.

We have the dsdlockd process running at the default interval of 900 seconds but after waiting for this period the defunct processes remain.

All these processes eventually disappear but sometimes we get so many that the machine grinds to a halt.

If we turn off the performance options for "enable row buffer inter-process" and just use the default then the problem is reduced. Trouble is the job is then too slow!

Any ideas as to what is happening here? We have not tried specifying a log file path in the dsdlockd.config file. If we did would it tell us if it finds and tries to remove defunct processes?

Pleased to receive any suggestions :)
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

[Server jobs]

What program are the processes running? (DSD.RUN or DSD.StageRun)
Use ps -aef to determine what resources each is consuming; take more than one sample and calculate the deltas.
Try reducing the dsdlockd interval to, say, 450 seconds.
Are the jobs and stages reported in Monitor view as finished?

It would not hurt to capture a log file for dsdlockd, but I'm not sure that it reports removal of defunct processes; it's really intended for reporting discovery (and resolution, if configured) of deadlock situations.

You might also like to investigate periodically running a shell script that executes the commands:

. `cat /.dshome`/dsenv
`cat /.dshome`/bin/dslictool clean_lic -a




Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518
Paul Preston
Participant
Posts: 24
Joined: Wed Apr 02, 2003 7:09 am
Location: United Kingdom

Post by Paul Preston »

Hello Ray

we see both DSD.RUN and DSD.StageRun and lots of processes.
Yes jobs and processes are reported as finished in Monitor.
We are now capturing dsd,lockd log file but nothing is reported apart from normal start. We have changed interval for dsdlockd to 300. Still it seems that defunct processes are not removed by dsdlockd.

Doing a delta on the cpu we can't see any being used against each defunct process but the total cpu use on prstat -a shows that in total they are using something.

Then problem is noticeably worse as machine load rises.

I can ask the unix administrator (as super user) to run the commands:
`cat /.dshome`/dsenv
`cat /.dshome`/bin/dslictool clean_lic -a

but he will want to know exactly what dslictool will do. I tried running dslictool and it reported that we have 8 cpus licensed but told me I needed to be superuser to run with the clean_lic option.

Is there any harm running a shell script to kill defunct processes (perhaps as a Datastage job post job shell script)?

Paul.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The difficulty is knowing exactly which processes to kill. The job runs as one process, and active stages (such as Transformer stages) run as child processes. Any use of DSExecute, ExecSH, ExecTCL and so on may also start child processes.
You can, of course, determine the job's pid in an after-job subroutine

DECLARE GCI getpid
...
JobPid = getpid()

Then identify its children and kill those (since, by the time an after-job subroutine is executing, there should not be any). However:

The dslictool command with clean_lic option is the preferred way to clean up defunct processes, since it frees resources such as locks and open files held by those processes, which kill does not do. It cleans up based on the shared memory segments associated with dead processes. (This is why it must run with superuser privilege.)
The -a option recomputes licence counts.



Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518
Post Reply