Page 1 of 1

DS jobs hang

Posted: Thu Apr 06, 2006 1:10 pm
by jane_jiang
The issue we had in our production daily ETL load is the DS jobs got hung for some reason, it happened on the different DS job each time. When the job got hung, there is no error message in the job log, and the job status still show like "running". We had to kill(stop the job manually) DS jobs from the DS director, and kill the phantom from the server box. And most of the time, the job was complete successfully in the re-run without any code changes.

We have tried to restart the DS sever, and cleaned up the Project &PH& directory. The "Do not Timeout" is also checked. But no of them seemed work.

Any experiences as above? Can anyone share the possible reasons?

Posted: Thu Apr 06, 2006 1:33 pm
by kcbland
Reasons jobs "stall" at the end of the job:

1. Full job logs automatically purging
2. After-stage or after-job routines going out and doing something
3. After-stage or after-job routines in two jobs locking same object (database table, sequential file, etc)
4. Commit is issued, database doing it's thing
5. &PH& project directory full of lots of files
6. DS server overloaded, takes a long time to catch up to jobs true statuses

Posted: Thu Apr 06, 2006 4:38 pm
by ray.wurlod
If you kill the job processes on the server, they will never get the chance to update their status values (which are read by the Director client) and so will appear to remain in a "running" state.

Posted: Tue Apr 11, 2006 9:42 am
by jane_jiang
Thanks for your helpful information.

Kcbland, the job hanging problem we had is the job was hung forever. The job either try to query from the oracle database, or try to do a insert. For the reason 6 you gave me about, do you know how could I find out the DS server is overloaded? Is there a way to monitor it and fix the problem?

Ray, the problem we had is the job was running for too long, and we have to stop the job from the director client, but the phantom was still there on the sever side even the job got stopped. Do you know if there is a better way to kill the phantom?

Do you think this job hanging issue might be related to the "timeout" between the DS sever box and the Oracle server? Is there a variable we I can tune in the uvconfig file?

Again, thanks so much.

Posted: Tue Apr 11, 2006 10:05 am
by kcbland
prstat (Solaris 2.8+), iostat, vmstat, netstat are commonly used tools for performance monitoring. Consider downloading top (prstat is a version of it) or glance from HP.

Posted: Tue Apr 11, 2006 4:54 pm
by ray.wurlod
You're more likely to find a timeout to tune in the Oracle client software. There's nothing in uvconfig that would help.

Posted: Wed Apr 12, 2006 1:52 am
by raj_konig
Hi Jane,

With my experience I feel that the query u r using in the job needs some tuning. try running the same query in the database instead of running thru the datastage. This may fix your problem.

If not, then the query/ job must be in waiting state as the table on which your are trying to perform the operation is "locked".

If you face the similiar issue again better not forcebly stop the job, check the target table status, check whether the job got locked or any table it is accessing is locked.

This may fix ur issue.

Thanks,
raj