Job status "running" when all stages "finished"

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Job status "running" when all stages "finished"

Post by ray.wurlod »

Bala wrote:
Re-run of that etl job was successful. But, in the next run, another other still is in "Running" status, but monitor shows "Finished" for all the links. What could be the reason for this. Please advice if anyone knows.

Each active stage runs as a child of the job process, which waits for each child to finish and notify it. If any one of those signals is not received, the job will never mark its status as finished. Another possibility is that an after-job subroutine fails to return.
If you are 100% certain that all process is finished, you can clear the job's status file - and possibly reset the job - to remedy this condition.
Bala
Participant
Posts: 17
Joined: Mon Oct 14, 2002 8:05 pm

Post by Bala »

Following actions I had tried, but no use.
1. clear &PH& directory.
2. clear status file
3. clear resources.
4. clear logs.
5. cleanup project.
So, I setup the Main job so as to call the child jobs sequentially, instead of calling the independent jobs parallely. This action has improved the situation better than before.Thx.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

As your skill with DataStage BASIC improves (or if you attend the Programming with DataStage BASIC class) you will be able to create control jobs that are somewhat more bullet-proof than the default.
mihai
Participant
Posts: 30
Joined: Thu Mar 13, 2003 5:24 am
Location: Hertfordshire

Post by mihai »

Bala..


We are experiencing the same problem here. The problem has been flagged with Ascential since version 3.x, but no resolution is forthcoming just yet. We have been trying to get a reproduceable environment for years, but without success (even though we are experiencing one of these goodies regularly).


First of all, bear in mind that the status declared by the Director is not necessarily the status of the job. Clearing the RT_STATUS file does not necessarily reflect in the Director. You will find that once the resources have been freed and the status file cleared, the Director status may still indicate Running against the job. Nevertheless, you can (in most cases) recompile the job without problems (something that should not be possible if the job was actually running).

We call this symptom a 'job hang' and we have developed a suite of routines that monitor the job's stages (using DSGetStageInfo) and the number of rows they process. When the row volume has not changed for a period of time that is unusually long, we declare the job as 'hung'.


Once the job is hung, we do a recovery process which boils down to the following:

*) run DS.PLADMIN.CMD NOPROMPT LIST PIDS <jobname> to identify the process IDs belonging to the job
*) run DS.PLADMIN.CMD NOPROMPT CLEAR PIDS <jobname> to get DataStage to log out its processes (this doesn't always work)
*) run the NTResKIt's kill command against each of the processes that belong to the job that are still milling around
*) LOGOUT each process in turn
*) Clear the status file
*) Re-run the job.

So far, this 'patch' is still experimental and it's currently being tested.

We have also noticed Access Violation messages in the Application Logs on the DataStage server occurring at the same time as the hangs.



The factors that /may/ contribute to its rise have been suggested as:
*) excessive fragmentation of the filesystem
*) multi-processor architecture leading to threads being mislaid by the OS
*) Memory management by the OS


I'm sorry to say there is no fireproof resolution to this, although we were told that upgrading will resolve it (hah!). We're running 4.2.1r8. What are you running?

If you have a reproducible environment (and an Ascential support contract), please let me know.


_________________
desk: 01908448571
Post Reply