Replication of scenario, Job sequence with status as CRASHED

harikhk · Post by **harikhk** » Wed Nov 02, 2016 10:37 am

Hello All,

I have below requirement to validate the status of a job/sequence before running the job from command line using dsjob
$HOME_DS/bin/dsjob -jobinfo PROJECT SEQUENCE_NAME

Validate the status of the job for the status of STOPPED/FAILED/CRASHED/
and reset the job in case encountered any of the three states

I am aware how to test the status of STOPPED and FAILED.

Q1.How can I replicate the scenario of CRASHED in order to test my script ?

$HOME_DS/bin/dsjob -run -mode RESTART -wait -jobstatus PROJECT SEQUENCE_NAME

Looking fro your suggestions

Thank you

Mike · Post by **Mike** » Wed Nov 02, 2016 5:56 pm

Start a sufficiently long running job sequence and have your admin stop the DataStage engine. After the DataStage engine is brought back up again you should have a CRASHED job sequence.

Jobs end up CRASHED when there is an unexpected stoppage of the engine.

Mike

qt_ky · Post by **qt_ky** » Thu Nov 03, 2016 7:31 am

It may have CRASHED status, but I have also seen jobs incorrectly stuck in a RUNNING status post engine restart.

chulett · Post by **chulett** » Thu Nov 03, 2016 7:53 am

Ditto.

In your shoes I would test the 'normal' status results and then, knowing my mechanism was sound, would sleep soundly at night knowing it would catch the outliers as well.

harikhk · Post by **harikhk** » Fri Nov 04, 2016 10:24 am

To stop the engine is difficult as there are other projects that are running on this server.

$HOME_DS/bin/dsjob -jobinfo PROJECT SEQUENCE_NAME

In the script logic validating if the status of the job is FAILED/STOPPED/CRASHED and if the job is restartable(check pointed sequence)

In my case the sequence is restartable and to replicate the scenario of CRASHED, hard coded to give the result of the execution of job info as CRASHED

And in case is CRASHED, executing the below command
$HOME_DS/bin/dsjob -run -mode RESTART -wait -jobstatus PROJECT SEQUENCE_NAME

This allows the sequence to be restarted from the point of failure but the catch is after the completion of execution of sequence (from the point of failure), the sequence gets re executed automatically

Now I am left with why the sequence is getting re executed.
Is it because of the word RESTART?
In case I change it to RESET instead of RESTART, the complete sequence would execute instead of from the point of failure.

Any suggestions to make this work without issues

Thanks

qt_ky · Post by **qt_ky** » Fri Nov 04, 2016 11:25 am

It shouldn't restart, complete, and then run it all over again.... unless perhaps (just a guess) if there were multiple scripts or restart commands issued at the same time with the -wait option. I would want to try to reproduce that behavior, because it sounds like an unusual problem.

Mike · Post by **Mike** » Fri Nov 04, 2016 11:26 am

Personally, I would never attempt to automate the restart of a STOPPED or CRASHED job. These are unexpected ways for a job to end.

I would require human intervention to perform analysis and take corrective action based on the analysis.

Mike

qt_ky · Post by **qt_ky** » Fri Nov 04, 2016 11:27 am

I have to agree with Mike!

harikhk · Post by **harikhk** » Fri Nov 04, 2016 12:01 pm

qt_ky wrote:It shouldn't restart, complete, and then run it all over again.... unless perhaps (just a guess) if there were multiple scripts or restart commands issued at the same time with the -wait option. I would want to try to reproduce that behavior, because it sounds like an unusual problem.

It is a single script and is a single statement that gets executed, the behaviour is really unusual

harikhk · Post by **harikhk** » Fri Nov 04, 2016 12:03 pm

Mike wrote:Personally, I would never attempt to automate the restart of a STOPPED or CRASHED job. These are unexpected ways for a job to end.

I would require human intervention to perform analysis and take corrective action based on the analysis.

Had to automate as per the project requirement. Would you suggest any other option or is the manual intervention preferred

qt_ky · Post by **qt_ky** » Fri Nov 04, 2016 12:07 pm

How about automating detection of crashed or stopped and emailing an alert to say that intervention is required, for the reasons Mike mentioned?

Mike · Post by **Mike** » Fri Nov 04, 2016 12:19 pm

My opinion is that your project requirement is dangerous.

Jobs are not expected to abort/stop/crash.

I want to know what unexpected event triggered the abnormal termination.

Was it preventable? If it is due to a development or design defect, then get that fixed so that it doesn't repeat.

Is there a resource constraint that needs to be addressed?

Has the root cause of the abnormal termination been eliminated?

Is the appropriate recovery action a reset/rerun or is it a restart?

As qt_ky suggests, automated notification is the way to go so that analysis and corrective action can take place ASAP.

Mike

harikhk · Post by **harikhk** » Fri Nov 04, 2016 2:06 pm

Hi Mike,

The reason for CRASHED was because of an unplanned restart of the server and not because of resource issue or design issue.

As suggested I would implement the manual intervention by email or by monitoring.

The query still I have in my mind is why is the job restarting again automatically after completion

Mike · Post by **Mike** » Fri Nov 04, 2016 2:41 pm

My questions weren't specific to your particular issue... rather they are questions that I ask any time that a job terminates abnormally.

Regarding your specific issue with the restart causing more than one run, I would carefully read through each job runs' log to trace what was executed, what was skipped on restart, and what checkpoints were created. Perhaps you have activities that do not create a checkpoint. The "Summary of sequence run" log entry is particularly useful.

I wouldn't necessarily trust anything about a crashed job. The engine stopped abruptly, so all kinds of bad things could be possible.

Mike

PaulVL · Post by **PaulVL** » Fri Nov 04, 2016 3:21 pm

You should create a generic script that accepts project, job and invocation id of a desired target. The script should execute a reset of the job.

That way you can run your regular external job scheduler which most often has an ON DEMAND ability. The reset would execute as your Production Batch ID thus having full reset capabilities within the target project.

So, it would be the best of both worlds. It would allow your operations folks to have a reset ability on any job, and also cover you in terms of not always doing it automatically in your scripts. As Mike said, aborts need research. But as an admin, I'd rather not get paged out in the middle of the night just to hit a reset button if the application team understands it and just wants a reset done.