Page 1 of 1

Sequencer getting errors trying to run

Posted: Thu Oct 05, 2006 3:59 pm
by lgharis
My developers are having recurring problems with Sequencer jobs in DS 7.5.1A. Sequencers trying to invoke jobs are not able to get correct status of jobs. The jobs although being in compiled or finished state are not being able to be attached by the sequencer.

They normally resolve by recompiling the job. Does anyone have any suggestions as to what might be causing these errors?

Ex:
Sequencer sqPM_DwHoldMonthly gives following messages when trying to invoke job jbPM_SyncMlyHoldTransDW:

sqPM_DwHoldMonthly.1.JobControl (DSPrepareJob): Error getting status for job jbPM_SyncMlyHoldTransDW.1
sqPM_DwHoldMonthly.1.JobControl (@jbPM_SyncMlyHoldTransDW): Controller problem: Error calling DSPrepareJob(jbPM_SyncMlyHoldTransDW.1)
(DSGetJobInfo) Failed to open RT_STATUS1439 file.
The sequencer then aborts with fatal error
sqPM_DwHoldMonthly.1.JobControl (fatal error from @Coordinator): Sequence job will abort due to previous unrecoverable errors

Same with sequencer sqPM_DwHoldDaily trying to invoke bPM_SyncDlyAcctRtnsDW

sqPM_DwHoldDaily.1.JobControl (DSPrepareJob): Error getting status for job jbPM_SyncDlyAcctRtnsDW.1
sqPM_DwHoldDaily.1.JobControl (@jbPM_SyncDlyAcctRtnsDW1): Controller problem: Error calling DSPrepareJob(jbPM_SyncDlyAcctRtnsDW.1)
(DSGetJobInfo) Failed to open RT_STATUS1429 file.

sqPM_DwHoldDaily.1.JobControl (fatal error from @Coordinator): Sequence job will abort due to previous unrecoverable errors

The jobs the sequencers are trying to invoke are in Compiled Status.

Posted: Thu Oct 05, 2006 4:10 pm
by kcbland
If job control is having issues getting info from log and status files, it's probably a maximum dynamic hashed files open issue. Can you confirm your T30FILES setting in the uvconfig file? If it's too low, like the default 200, then you can't properly execute jobs. The entire repository is based on dynamic hashed files, even if you're just running PX jobs you still are using the internal repository and therefore fall under this setting. Consider upping the value to 1000 or 2000, but not too high. This probably will fix your problem.

The other thing to consider is looking at the server node and measuring the cpu and disk resources to see if the server node is struggling to manage the repository and start/stop jobs.

Posted: Fri Oct 06, 2006 2:41 am
by ray.wurlod
Did the file system on which your project exists ever become full? If so, some of the repository tables (which are hashed files) may have become corrupted.

Search the (server) forum for ways to check the integrity of hashed files in the project. To check a single file you can run a query against it that must touch every page, for example

Code: Select all

SELECT COUNT(*) FROM RT_STATUS1429;

Posted: Fri Oct 06, 2006 7:34 am
by lgharis
kcbland,
Thanks, yes the T30FILE setting was still set at the default of 200. I do see that we have changed that setting to 800 on another server but this one is newer and has not been changed. We will update the uvconfig.

Is it also possible that a corruption of the hash files occurs because the developers attempt to update a job design after the job completes but while the sequencer job that executed it is still active? Or, if they reset the status of the job before the sequencer completes?


ray,
Thanks for that suggestion but I do not believe the filesystem filled up.

Posted: Fri Oct 06, 2006 2:34 pm
by ray.wurlod
It's perfectly OK (if not recommended practice) to work on a job design while the job is running. The running job uses the compiled version (generated OSH and any C++ components) of a job (not the design components), and you will find that DataStage prevents you from compiling a job that is actually running.

Resetting the status similarly performs an UPDATE on the RT_STATUSnnn hashed file; it is almost impossible that this would corrupt the hashed file (apart from hardware errors and so on).