Replication of scenario, Job sequence with status as CRASHED

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

harikhk
Participant
Posts: 64
Joined: Tue Jun 04, 2013 11:36 am

Replication of scenario, Job sequence with status as CRASHED

Post by harikhk »

Hello All,

I have below requirement to validate the status of a job/sequence before running the job from command line using dsjob
$HOME_DS/bin/dsjob -jobinfo PROJECT SEQUENCE_NAME

Validate the status of the job for the status of STOPPED/FAILED/CRASHED/
and reset the job in case encountered any of the three states

I am aware how to test the status of STOPPED and FAILED.

Q1.How can I replicate the scenario of CRASHED in order to test my script ?

$HOME_DS/bin/dsjob -run -mode RESTART -wait -jobstatus PROJECT SEQUENCE_NAME


Looking fro your suggestions

Thank you
Thanks,
HK
*Go GREEN..Save Earth*
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

Start a sufficiently long running job sequence and have your admin stop the DataStage engine. After the DataStage engine is brought back up again you should have a CRASHED job sequence.

Jobs end up CRASHED when there is an unexpected stoppage of the engine.

Mike
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

It may have CRASHED status, but I have also seen jobs incorrectly stuck in a RUNNING status post engine restart.
Choose a job you love, and you will never have to work a day in your life. - Confucius
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Ditto.

In your shoes I would test the 'normal' status results and then, knowing my mechanism was sound, would sleep soundly at night knowing it would catch the outliers as well. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
harikhk
Participant
Posts: 64
Joined: Tue Jun 04, 2013 11:36 am

Post by harikhk »

To stop the engine is difficult as there are other projects that are running on this server.

$HOME_DS/bin/dsjob -jobinfo PROJECT SEQUENCE_NAME

In the script logic validating if the status of the job is FAILED/STOPPED/CRASHED and if the job is restartable(check pointed sequence)

In my case the sequence is restartable and to replicate the scenario of CRASHED, hard coded to give the result of the execution of job info as CRASHED

And in case is CRASHED, executing the below command
$HOME_DS/bin/dsjob -run -mode RESTART -wait -jobstatus PROJECT SEQUENCE_NAME

This allows the sequence to be restarted from the point of failure but the catch is after the completion of execution of sequence (from the point of failure), the sequence gets re executed automatically

Now I am left with why the sequence is getting re executed.
Is it because of the word RESTART?
In case I change it to RESET instead of RESTART, the complete sequence would execute instead of from the point of failure.

Any suggestions to make this work without issues

Thanks
Thanks,
HK
*Go GREEN..Save Earth*
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

It shouldn't restart, complete, and then run it all over again.... unless perhaps (just a guess) if there were multiple scripts or restart commands issued at the same time with the -wait option. I would want to try to reproduce that behavior, because it sounds like an unusual problem.
Choose a job you love, and you will never have to work a day in your life. - Confucius
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

Personally, I would never attempt to automate the restart of a STOPPED or CRASHED job. These are unexpected ways for a job to end.

I would require human intervention to perform analysis and take corrective action based on the analysis.

Mike
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

I have to agree with Mike!
Choose a job you love, and you will never have to work a day in your life. - Confucius
harikhk
Participant
Posts: 64
Joined: Tue Jun 04, 2013 11:36 am

Post by harikhk »

qt_ky wrote:It shouldn't restart, complete, and then run it all over again.... unless perhaps (just a guess) if there were multiple scripts or restart commands issued at the same time with the -wait option. I would want to try to reproduce that behavior, because it sounds like an unusual problem.
It is a single script and is a single statement that gets executed, the behaviour is really unusual
Thanks,
HK
*Go GREEN..Save Earth*
harikhk
Participant
Posts: 64
Joined: Tue Jun 04, 2013 11:36 am

Post by harikhk »

Mike wrote:Personally, I would never attempt to automate the restart of a STOPPED or CRASHED job. These are unexpected ways for a job to end.

I would require human intervention to perform analysis and take corrective action based on the analysis.
Had to automate as per the project requirement. Would you suggest any other option or is the manual intervention preferred
Thanks,
HK
*Go GREEN..Save Earth*
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

How about automating detection of crashed or stopped and emailing an alert to say that intervention is required, for the reasons Mike mentioned?
Choose a job you love, and you will never have to work a day in your life. - Confucius
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

My opinion is that your project requirement is dangerous.

Jobs are not expected to abort/stop/crash.

I want to know what unexpected event triggered the abnormal termination.

Was it preventable? If it is due to a development or design defect, then get that fixed so that it doesn't repeat.

Is there a resource constraint that needs to be addressed?

Has the root cause of the abnormal termination been eliminated?

Is the appropriate recovery action a reset/rerun or is it a restart?

As qt_ky suggests, automated notification is the way to go so that analysis and corrective action can take place ASAP.

Mike
harikhk
Participant
Posts: 64
Joined: Tue Jun 04, 2013 11:36 am

Post by harikhk »

Hi Mike,

The reason for CRASHED was because of an unplanned restart of the server and not because of resource issue or design issue.

As suggested I would implement the manual intervention by email or by monitoring.

The query still I have in my mind is why is the job restarting again automatically after completion
Thanks,
HK
*Go GREEN..Save Earth*
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

My questions weren't specific to your particular issue... rather they are questions that I ask any time that a job terminates abnormally.

Regarding your specific issue with the restart causing more than one run, I would carefully read through each job runs' log to trace what was executed, what was skipped on restart, and what checkpoints were created. Perhaps you have activities that do not create a checkpoint. The "Summary of sequence run" log entry is particularly useful.

I wouldn't necessarily trust anything about a crashed job. The engine stopped abruptly, so all kinds of bad things could be possible.

Mike
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

You should create a generic script that accepts project, job and invocation id of a desired target. The script should execute a reset of the job.

That way you can run your regular external job scheduler which most often has an ON DEMAND ability. The reset would execute as your Production Batch ID thus having full reset capabilities within the target project.

So, it would be the best of both worlds. It would allow your operations folks to have a reset ability on any job, and also cover you in terms of not always doing it automatically in your scripts. As Mike said, aborts need research. But as an admin, I'd rather not get paged out in the middle of the night just to hit a reset button if the application team understands it and just wants a reset done.
Post Reply