DSRunJob returning fatal errors as non-fatal?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Richard615
Participant
Posts: 12
Joined: Tue Mar 27, 2007 11:08 am

DSRunJob returning fatal errors as non-fatal?

Post by Richard615 »

Folks;

Here's the skinny. We have a routine that runs every morning, which calls a parallel job. In that routine, here's the bit that actually is kicking off the job:

errStatus = DSRunJob(hJob, DSJ.RUNNORMAL)

errStatus = DSWaitForJob(hJob)

JobStatus = JobFailedCheck(DSGetJobInfo(hJob, DSJ.JOBSTATUS))

These three lines are exactly as-is - there is *NO* checking of those errStatus return codes. This is inherited code that works fine 99.999% of the time. Once in a blue moon (like Tuesday night) the following happens from what I can tell from the log file:

1) The DSRunJob is run
2) EXACTLY one minute later, the DSWaitForJob is called, which returns instantly.
3) The JobStatus is checked, finds a good job status, and the routine continues processing.

BUT - the job is never actually run. The DSWaitForJob finished because of course the job is in a finished status from the day before, and the JobStatus that is read is also that from the previous day's run.

Notice that the DSRunJob step took exactly one minute - the same length as the normal time-out limit. Except usually when that happens a fatal error is thrown. In this case, no such error occurred. We had fifteen jobs kick off at once, and six of them didn't work in this fashion - and all six have the exact same times in their log files. So it seems like a normal case of DSRunJob timing out - except for the lack of a fatal error as mentioned.

From the number of jobs running and the times invovled, it appears like this was a time-out error that wasn't flagged for some reason. But even if it was some other error (I can't tell because the return codes were not logged) I would think any error that causes the job not to be run should be a fatal error.

Has anyone else ever seen anything like this before?

Seems strange to me...

Thanks - Richard
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Particularly in a multi-machine configuration, or for a complex job, DSWaitForJob() may kick in too early. DSRunJob() is an asynchronous call, so returns immediately once the run request has been submitted.
The job status does not change to starting until the score has been composed. The workaround is to put a short sleep between DSRunJob() and DSWaitForJob(). Experiment with the duration - somewhere between two and five seconds should suffice.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply