DSWaitForJob waiting indefinately

chulett · Post by **chulett** » Wed Aug 29, 2007 6:50 am

I don't have an answer for you, but just wanted to compliment you on your post. If anyone needed a model for getting help on a problem, they have one now. Excellent.

katz · Post by **katz** » Wed Aug 29, 2007 7:01 am

Thanks Craig - its nice to know that my effort to be clear is appreciated. However, when reading your post I've noticed that I've posted in the incorrect forum.

katz

chulett · Post by **chulett** » Wed Aug 29, 2007 7:13 am

D'oh!

ray.wurlod · Post by **ray.wurlod** » Wed Aug 29, 2007 4:44 pm

You are correct (at least as far as I know); DSWaitForJob interrogates the RT_STATUSnnn table for the job. This table sometimes does not get updated - which is why killed jobs sometimes appear to retain a "Running" state forever.

However, if the job status shows as "Finished", then one of the active stage records in RT_STATUSnnn may not have been updated. These records are used for the Monitor. Does the Monitor show all stages finished when this problem occurs?

katz · Post by **katz** » Thu Aug 30, 2007 5:57 am

Yes, the monitor shows that all the stages have completed.

There have been a couple of cases where the DSWaitForJob executed in an After Job Routine was the one that "randomly" failed to detect when the called job was finished. But the problem has equally occurred in jobs that do not use an after routine, so I don't feel that the issue is related to the routine.

I have recompiled all the jobs, but that has not made any difference.

As I mentioned this problem did not occur before we recently implemented Pluggable Authentication (PAM), which entails executing a uvregen, and although the only difference made in the uvconfig source is setting the parameter value AUTHENICATION 1, I cannot help but wonder if the new UV object has some issue.

Also, I have discovered that the dsepam entry was not created in the pam.conf file, however I can't see any direct relationship between that entry and the symptoms I have (and all users are able to connect without the dsepam entry). Never-the-less I've requested that the dsepam entry be made just so I can rule out that possibility.

Thanks,
katz

ArndW · Post by **ArndW** » Thu Aug 30, 2007 4:11 pm

Katz,

I've often implemented a small loop instead of the non-interruptable DSWaitForJob() call. It will issue a call to DSGetJobInfo() to get the status, and if it is still running it will wait a couple of seconds and then try again. That way I can issue a call to DSLogFatal() if I end up waiting too long. Although this will not stop the DSWaitForJob() hang situation, it will let you control how to fail the processes.

ray.wurlod · Post by **ray.wurlod** » Sun Sep 02, 2007 12:05 am

The highlighted message means that DSWaitForJob() returns immediately under either of the following two conditions:

the job on the job handle has finished

the job on the job handle has been started again after finishing on the same job handle (that is, without there having been a call to DSDetachJob() function

srinagesh · Post by **srinagesh** » Tue Sep 04, 2007 6:16 am

Check whether there are any network glitches / system activity at that time.

You can look for these messages in /var/adm/messages

ray.wurlod · Post by **ray.wurlod** » Tue Sep 04, 2007 4:28 pm

Different records in RT_STATUSnnn have different structures. Only the first five are common to all record types. There are records for the job, for each active stage, and for each "resource".

katz · Post by **katz** » Sat Sep 22, 2007 1:28 pm

The underlying problem with the TZ environment variable was resolved by restarting the cron daemon. The work-around assignment made in dsenv file can now be removed, and there are no more incidents of jobs hanging on the DSWaitForJob.