Page 1 of 1

Posted: Wed Aug 29, 2007 6:50 am
by chulett
I don't have an answer for you, but just wanted to compliment you on your post. If anyone needed a model for getting help on a problem, they have one now. Excellent. :wink:

Posted: Wed Aug 29, 2007 7:01 am
by katz
Thanks Craig - its nice to know that my effort to be clear is appreciated. However, when reading your post I've noticed that I've posted in the incorrect forum.

katz

Posted: Wed Aug 29, 2007 7:13 am
by chulett
D'oh! :wink:

Posted: Wed Aug 29, 2007 4:44 pm
by ray.wurlod
You are correct (at least as far as I know); DSWaitForJob interrogates the RT_STATUSnnn table for the job. This table sometimes does not get updated - which is why killed jobs sometimes appear to retain a "Running" state forever.

However, if the job status shows as "Finished", then one of the active stage records in RT_STATUSnnn may not have been updated. These records are used for the Monitor. Does the Monitor show all stages finished when this problem occurs?

Posted: Thu Aug 30, 2007 5:57 am
by katz
Yes, the monitor shows that all the stages have completed.

There have been a couple of cases where the DSWaitForJob executed in an After Job Routine was the one that "randomly" failed to detect when the called job was finished. But the problem has equally occurred in jobs that do not use an after routine, so I don't feel that the issue is related to the routine.

I have recompiled all the jobs, but that has not made any difference.

As I mentioned this problem did not occur before we recently implemented Pluggable Authentication (PAM), which entails executing a uvregen, and although the only difference made in the uvconfig source is setting the parameter value AUTHENICATION 1, I cannot help but wonder if the new UV object has some issue.

Also, I have discovered that the dsepam entry was not created in the pam.conf file, however I can't see any direct relationship between that entry and the symptoms I have (and all users are able to connect without the dsepam entry). Never-the-less I've requested that the dsepam entry be made just so I can rule out that possibility.

Thanks,
katz

Posted: Thu Aug 30, 2007 4:11 pm
by ArndW
Katz,

I've often implemented a small loop instead of the non-interruptable DSWaitForJob() call. It will issue a call to DSGetJobInfo() to get the status, and if it is still running it will wait a couple of seconds and then try again. That way I can issue a call to DSLogFatal() if I end up waiting too long. Although this will not stop the DSWaitForJob() hang situation, it will let you control how to fail the processes.

Posted: Sun Sep 02, 2007 12:05 am
by ray.wurlod
The highlighted message means that DSWaitForJob() returns immediately under either of the following two conditions:
  • the job on the job handle has finished

    the job on the job handle has been started again after finishing on the same job handle (that is, without there having been a call to DSDetachJob() function

Posted: Tue Sep 04, 2007 6:16 am
by srinagesh
Check whether there are any network glitches / system activity at that time.

You can look for these messages in /var/adm/messages

Posted: Tue Sep 04, 2007 4:28 pm
by ray.wurlod
Different records in RT_STATUSnnn have different structures. Only the first five are common to all record types. There are records for the job, for each active stage, and for each "resource".

Posted: Sat Sep 22, 2007 1:28 pm
by katz
The underlying problem with the TZ environment variable was resolved by restarting the cron daemon. The work-around assignment made in dsenv file can now be removed, and there are no more incidents of jobs hanging on the DSWaitForJob.