Rerunning datastage

bdixon · Post by **bdixon** » Wed Jan 21, 2004 4:37 pm

Hi,

We are looking at redesigning our datastage batch runs and I am after some tips on making the batch easily rerunnable. Because from time to time our batch will fallover due to an implementation error or a corrupt record which we have not catered for. Does anyone have any suggestions on how we can do this?

Thanks
Brad

ray.wurlod · Post by **ray.wurlod** » Wed Jan 21, 2004 7:35 pm

There are a number of techniques for making job control code bullet proof. They include:

ensuring that the job is attached properly
ensuring that a job is in a runnable state before requesting its run
ensuring that parameter values are read accurately from wherever
ensuring that parameter values are set without error in controlled job
retaining control rather than performing a "sleep wait"
determining the exit status accurately
never calling DSLogFatal or executing STOP or ABORT statements
detaching the job when you're done (to free resources)

There are consultants out there with the skills to create one of these for your particular requirements. Hiring one will cost you less than trying to figure it out for yourself.
Constructing a job sequence that caters for all the above is not possible even unto release 7.x. It's getting better over time, but it's not all there yet.

And then there's the question of restartability of an entire sequence; do you need to build in the capacity to pick up where it left off? That's a more complex solution, and one of the reasons (among several) that we seem to go on and on about staging as part of your design.

Teej · Post by **Teej** » Wed Jan 21, 2004 10:53 pm

ray.wurlod wrote:There are a number of techniques for making job control code bullet proof. They include:

...

never calling DSLogFatal or executing STOP or ABORT statements

Why not? What is the proper way to abort?

-T.J.

kcbland · Post by **kcbland** » Wed Jan 21, 2004 11:19 pm

Teej wrote:
ray.wurlod wrote:There are a number of techniques for making job control code bullet proof. They include:

...

never calling DSLogFatal or executing STOP or ABORT statements
Why not? What is the proper way to abort?

-T.J.

You never abort the top level job. If the top level job is the thing that can never fail, then never use APIs or BASIC statements that leave the job in an unrunnable state. If the top level job has the ability to prepare the jobstream, then it must always be ready to run. Especially if an enterprise scheduler is remote-commandering the jobstream and will pass in appropriate runtime parameter values that give the boundaries for the job stream execution. Can't execute that job using dsjob if it's in a non-runnable state. You could bullet proof the ksh script that is being used to run the master job (first check if the master control job is runnable, if not, reset it, then run it). It's your choice. However, if you blowup the master control job, make sure you do it after allllll tasks are done, wrapped up, logged, etc. Pulling the ripcord in the middle, leaving things in an awful mess, is usually a bad thing.

ray.wurlod · Post by **ray.wurlod** » Wed Jan 21, 2004 11:44 pm

Teej wrote:Why not? What is the proper way to abort?-T.J.

You don't. You provide a path through your job control code whereby jobs don't get started. Usually with IF.

Teej · Post by **Teej** » Thu Jan 22, 2004 8:57 am

kcbland wrote:You could bullet proof the ksh script that is being used to run the master job (first check if the master control job is runnable, if not, reset it, then run it).

That is basically what we do. Well, we do use the Job Sequencer instead of the job control script (and I did experiment with the one provided by yourself).

However, if you blowup the master control job, make sure you do it after allllll tasks are done, wrapped up, logged, etc. Pulling the ripcord in the middle, leaving things in an awful mess, is usually a bad thing.

We also do some of that in our KSH script. The one thing we do not like about using AbortToLog() within the JobControl is its inability to tell ALL related jobs to stop what they're doing and then abort. So if multiple jobs are running at the same time, whoops.

Actually, I haven't gotten around to finding or building a better AbortToLog()... Hmm. Will have to slot it on my tasks to do here. Of course, that kind of feature is marketed (along with other Job Sequencer enhancements) for Trinity (7.1).

-T.J.

kcbland · Post by **kcbland** » Thu Jan 22, 2004 9:06 am

One mutters: "Yeah, always in the next release. And I'm sure it will be exactly what I want.

"

vmcburney · Post by **vmcburney** » Thu Jan 22, 2004 6:19 pm

So what is the problem with sequence jobs? I've managed to get most of the functionality I need using a combination of sequence jobs and job control routines. I find that calling routines from sequence jobs helps promote good programming practices, makes it easier to see what is going on and develop parallel processing paths.

Going over Ray's list of bullet proof job control techniques:

ensuring that the job is attached properly
ensuring that a job is in a runnable state before requesting its run

- Doesn't the job stage automatically do this? It resets aborted jobs. I'm assuming here that you don't get uncompiled jobs in a production environment where jobs are read only.

ensuring that parameter values are read accurately from wherever
ensuring that parameter values are set without error in controlled job

- I use a routine that reads the parameters from a table or file and prepares and runs a job. All jobs are run from a single generic run job routine and appear in the sequence job as a routine stage.

retaining control rather than performing a "sleep wait"

- could someone expand on this. Is this the DSWaitForJob that is built into the run job stage?

determining the exit status accurately

- I use a routine which is executed each time a job finishes that checks the status of that job and performs error notification.

never calling DSLogFatal or executing STOP or ABORT statements
detaching the job when you're done (to free resources)

- Agree, sequence jobs should use triggers to end the processing in the controlling job with a notification message.

kcbland · Post by **kcbland** » Thu Jan 22, 2004 6:29 pm

vmcburney wrote:So what is the problem with sequence jobs?

I think the graphical metaphor should look like a Microsoft Project Plan, with milestones to signify major points in the process, for direct branching. It should have rollup tasks, so that you can expand/contract the details when necessary. It should handle dynamic instantiation to expand a single job into a set of divide and conquer clones. It should not be layers and layers of nested sequences, as the graphic metaphor is defeated. I also think it's error handling is terrible, in its current incarnation.

vmcburney wrote:
retaining control rather than performing a "sleep wait"
- could someone expand on this. Is this the DSWaitForJob that is built into the run job stage?

DSWaitForJob is a misnomer. It should be DSWaitForever, as in it will wait forever for a job to finish. There are no mechanisms to track if the job is running too long, has its performance reached a point where a notification should take place, are its link statistics outside tolerances, etc. It blindly halts processing as it waits for a job to finish. It's not fire and forget, or fire and check back later, it's fixation on that job.

Teej · Post by **Teej** » Fri Jan 23, 2004 4:26 pm

vmcburney wrote:
never calling DSLogFatal or executing STOP or ABORT statements
detaching the job when you're done (to free resources)
- Agree, sequence jobs should use triggers to end the processing in the controlling job with a notification message.

This caught my eye -- if a Job fails -- you suggested that triggers can be used to have other jobs (within the Job Sequence that are run in parallel with this job) abort, correct?

How?

-T.J.

chulett · Post by **chulett** » Fri Jan 23, 2004 4:36 pm

For those of you who don't also hang out at ADN, here is a link to an interesting post on a feature (related to this thread) that will purportedly be in the 7.1 release.

And T.J. - no, that's not what he meant.

At least, I don't think he meant jobs running in parallel, only subsequent Sequence jobs downstream of the current Sequence job.

vmcburney · Post by **vmcburney** » Sun Jan 25, 2004 4:29 am

I haven't tried to create an abort or stop that works across all running jobs, I just ensure that no future steps are started and the current sequence returns back up to the controlling sequence job as soon as running jobs are finished.

I once wrote an automatic retry in a job control routine, the routine would check the result of a finished job and if it crashed or aborted it would restart it with the same job parameters. The environment had an old DataStage and Sun Solaris combination a random job would randomly abort about 10% of the time. The retry and the notifications sent by the retry were inside a routine and the calling sequence job continue happily if the retry worked. The number of retries and the wait time between retries were both job parameters.

This type of thing corrects the effects but not the cause of aborts however the environment was so dodgy and the results were successful. Wouldn't do it in a new implementation.