Replication of scenario, Job sequence with status as CRASHED
Moderators: chulett, rschirm, roy
Replication of scenario, Job sequence with status as CRASHED
Hello All,
I have below requirement to validate the status of a job/sequence before running the job from command line using dsjob
$HOME_DS/bin/dsjob -jobinfo PROJECT SEQUENCE_NAME
Validate the status of the job for the status of STOPPED/FAILED/CRASHED/
and reset the job in case encountered any of the three states
I am aware how to test the status of STOPPED and FAILED.
Q1.How can I replicate the scenario of CRASHED in order to test my script ?
$HOME_DS/bin/dsjob -run -mode RESTART -wait -jobstatus PROJECT SEQUENCE_NAME
Looking fro your suggestions
Thank you
I have below requirement to validate the status of a job/sequence before running the job from command line using dsjob
$HOME_DS/bin/dsjob -jobinfo PROJECT SEQUENCE_NAME
Validate the status of the job for the status of STOPPED/FAILED/CRASHED/
and reset the job in case encountered any of the three states
I am aware how to test the status of STOPPED and FAILED.
Q1.How can I replicate the scenario of CRASHED in order to test my script ?
$HOME_DS/bin/dsjob -run -mode RESTART -wait -jobstatus PROJECT SEQUENCE_NAME
Looking fro your suggestions
Thank you
Thanks,
HK
*Go GREEN..Save Earth*
HK
*Go GREEN..Save Earth*
To stop the engine is difficult as there are other projects that are running on this server.
$HOME_DS/bin/dsjob -jobinfo PROJECT SEQUENCE_NAME
In the script logic validating if the status of the job is FAILED/STOPPED/CRASHED and if the job is restartable(check pointed sequence)
In my case the sequence is restartable and to replicate the scenario of CRASHED, hard coded to give the result of the execution of job info as CRASHED
And in case is CRASHED, executing the below command
$HOME_DS/bin/dsjob -run -mode RESTART -wait -jobstatus PROJECT SEQUENCE_NAME
This allows the sequence to be restarted from the point of failure but the catch is after the completion of execution of sequence (from the point of failure), the sequence gets re executed automatically
Now I am left with why the sequence is getting re executed.
Is it because of the word RESTART?
In case I change it to RESET instead of RESTART, the complete sequence would execute instead of from the point of failure.
Any suggestions to make this work without issues
Thanks
$HOME_DS/bin/dsjob -jobinfo PROJECT SEQUENCE_NAME
In the script logic validating if the status of the job is FAILED/STOPPED/CRASHED and if the job is restartable(check pointed sequence)
In my case the sequence is restartable and to replicate the scenario of CRASHED, hard coded to give the result of the execution of job info as CRASHED
And in case is CRASHED, executing the below command
$HOME_DS/bin/dsjob -run -mode RESTART -wait -jobstatus PROJECT SEQUENCE_NAME
This allows the sequence to be restarted from the point of failure but the catch is after the completion of execution of sequence (from the point of failure), the sequence gets re executed automatically
Now I am left with why the sequence is getting re executed.
Is it because of the word RESTART?
In case I change it to RESET instead of RESTART, the complete sequence would execute instead of from the point of failure.
Any suggestions to make this work without issues
Thanks
Thanks,
HK
*Go GREEN..Save Earth*
HK
*Go GREEN..Save Earth*
It shouldn't restart, complete, and then run it all over again.... unless perhaps (just a guess) if there were multiple scripts or restart commands issued at the same time with the -wait option. I would want to try to reproduce that behavior, because it sounds like an unusual problem.
Choose a job you love, and you will never have to work a day in your life. - Confucius
It is a single script and is a single statement that gets executed, the behaviour is really unusualqt_ky wrote:It shouldn't restart, complete, and then run it all over again.... unless perhaps (just a guess) if there were multiple scripts or restart commands issued at the same time with the -wait option. I would want to try to reproduce that behavior, because it sounds like an unusual problem.
Thanks,
HK
*Go GREEN..Save Earth*
HK
*Go GREEN..Save Earth*
Had to automate as per the project requirement. Would you suggest any other option or is the manual intervention preferredMike wrote:Personally, I would never attempt to automate the restart of a STOPPED or CRASHED job. These are unexpected ways for a job to end.
I would require human intervention to perform analysis and take corrective action based on the analysis.
Thanks,
HK
*Go GREEN..Save Earth*
HK
*Go GREEN..Save Earth*
My opinion is that your project requirement is dangerous.
Jobs are not expected to abort/stop/crash.
I want to know what unexpected event triggered the abnormal termination.
Was it preventable? If it is due to a development or design defect, then get that fixed so that it doesn't repeat.
Is there a resource constraint that needs to be addressed?
Has the root cause of the abnormal termination been eliminated?
Is the appropriate recovery action a reset/rerun or is it a restart?
As qt_ky suggests, automated notification is the way to go so that analysis and corrective action can take place ASAP.
Mike
Jobs are not expected to abort/stop/crash.
I want to know what unexpected event triggered the abnormal termination.
Was it preventable? If it is due to a development or design defect, then get that fixed so that it doesn't repeat.
Is there a resource constraint that needs to be addressed?
Has the root cause of the abnormal termination been eliminated?
Is the appropriate recovery action a reset/rerun or is it a restart?
As qt_ky suggests, automated notification is the way to go so that analysis and corrective action can take place ASAP.
Mike
Hi Mike,
The reason for CRASHED was because of an unplanned restart of the server and not because of resource issue or design issue.
As suggested I would implement the manual intervention by email or by monitoring.
The query still I have in my mind is why is the job restarting again automatically after completion
The reason for CRASHED was because of an unplanned restart of the server and not because of resource issue or design issue.
As suggested I would implement the manual intervention by email or by monitoring.
The query still I have in my mind is why is the job restarting again automatically after completion
Thanks,
HK
*Go GREEN..Save Earth*
HK
*Go GREEN..Save Earth*
My questions weren't specific to your particular issue... rather they are questions that I ask any time that a job terminates abnormally.
Regarding your specific issue with the restart causing more than one run, I would carefully read through each job runs' log to trace what was executed, what was skipped on restart, and what checkpoints were created. Perhaps you have activities that do not create a checkpoint. The "Summary of sequence run" log entry is particularly useful.
I wouldn't necessarily trust anything about a crashed job. The engine stopped abruptly, so all kinds of bad things could be possible.
Mike
Regarding your specific issue with the restart causing more than one run, I would carefully read through each job runs' log to trace what was executed, what was skipped on restart, and what checkpoints were created. Perhaps you have activities that do not create a checkpoint. The "Summary of sequence run" log entry is particularly useful.
I wouldn't necessarily trust anything about a crashed job. The engine stopped abruptly, so all kinds of bad things could be possible.
Mike
You should create a generic script that accepts project, job and invocation id of a desired target. The script should execute a reset of the job.
That way you can run your regular external job scheduler which most often has an ON DEMAND ability. The reset would execute as your Production Batch ID thus having full reset capabilities within the target project.
So, it would be the best of both worlds. It would allow your operations folks to have a reset ability on any job, and also cover you in terms of not always doing it automatically in your scripts. As Mike said, aborts need research. But as an admin, I'd rather not get paged out in the middle of the night just to hit a reset button if the application team understands it and just wants a reset done.
That way you can run your regular external job scheduler which most often has an ON DEMAND ability. The reset would execute as your Production Batch ID thus having full reset capabilities within the target project.
So, it would be the best of both worlds. It would allow your operations folks to have a reset ability on any job, and also cover you in terms of not always doing it automatically in your scripts. As Mike said, aborts need research. But as an admin, I'd rather not get paged out in the middle of the night just to hit a reset button if the application team understands it and just wants a reset done.