Page 1 of 1

Detecting job restart within a Job Sequence

Posted: Tue Jun 02, 2009 8:52 pm
by GavMagill
Hi all.

I am putting together a restartable job sequence and would like to put in a process at the start of the job to detect whether the job is restarting after an abort so I can write this information to an audit record from within the job. The DataStage Director is able to natively detect when a job sequence is restarting after an abort and looking in the Job Control tab for a restartable Job Sequence the following line of code also seems to do restart detection as well.

"If DSCheckPointExists(DSJ.ME, cp$dtm) Then GoTo L$RESTART"

I have tried building a simple routine called from a Routine Activity to run this Function but it returns False when I pass the parent JS jobhandle during a abort restart. What I am wondering is, am I able to integrogate the return from the DSCheckPointExists Function some other way from within the job or is there some other simpler method of detecting when a Job Sequence is being restarted (from within the job)?

I have searched through the Archives looking under Checkpoints as this seems to be the best key word to use for Restartable Job Sequences but have not been able to find a reference to someone trying to do something similar. I have also looked through the doco but have not found any functionality which will enable me to do this.

I thought this would be something relatively simple to do and I am hoping I have not missed something obvious that I should know about. If I have missed something obvious I apologise in advance for wasting peoples time but would certainly appreciate being pointed in the right direction if this is the case. Otherwise if anyone has a solution or suggestion I would appreciate any assistance you can give me.

Thanks in advance.
Gavin

Posted: Tue Jun 02, 2009 11:36 pm
by ray.wurlod
Can you show us how you got the "parent JS jobhandle"? I'm assuming you've applied DSAttachJob() to the name returned by DSJobController macro (or equivalent call to DSGetJobInfo().

The other way is to check for the existence of a CHECKPOINT record in the RT_STATUSnnn hashed file for the job. This will exist when in a restart situation and will not exist in a clean start situation. Of course, you need the job number from DS_JOBS for this approach.

Posted: Wed Jun 03, 2009 6:50 am
by chulett
I've looked at the same code and have very little idea what cp$dtm would be. That's the problem with going 'under the covers' and using undocumented internal functions not meant for us mere mortals. :wink:

Posted: Wed Jun 03, 2009 3:56 pm
by ray.wurlod
cp$dtm is just a variable name. You can find its assignment statement earlier in the job control code.

Posted: Wed Jun 03, 2009 4:12 pm
by chulett
I know... and no, I couldn't - hence my comment. It was referenced in the DSCheckPointExists() and DSLogInfo() calls but nowhere else. :?

Posted: Wed Jun 03, 2009 4:17 pm
by ray.wurlod
Check the included header files? EQUATE declaration? (I don't have any restartable sequences at the moment, so can't check.)

Posted: Wed Jun 03, 2009 4:29 pm
by chulett
It does a single EQU and an INCLUDE on DSJ_XFUNCS.H but there's not much to it. I haven't done an exhaustive search yet but haven't had any luck where I have looked. :(

Posted: Wed Jun 03, 2009 11:06 pm
by GavMagill
Hi Ray / Craig

Thanks for your input. I have managed to cludge together a routine which checks the RT_STATUS hashfile to see if the Checkpoint row has been set and this is giving me the answer I need for the time being. Have to say my BASIC skills are pretty rusty so it took longer than it should have to get it up and running but it is working ok.

I went down the path of looking at whether I could findwhat sets the cp$dtm variable and like Craig I was not able to figure out where the EQUATE statement occurs hence my original posting above to ask for help.

Using the RT_STATUS file feels a bit clunky and I will need to make sure it continues to work in later upgrades but for the moment it is working for me so I just wanted say thankyou for everyones assistance.

Regards
Gavin

Posted: Wed Jun 03, 2009 11:31 pm
by chulett
I'd wager that STATUS hahsed file is indeed the same thing the routine checks against. And if you'd like to post your routine code, we'd be glad to rip it to^H^H^H err, have a look at it and give you some pointers. :wink:

Unix approach.

Posted: Thu Jun 04, 2009 12:41 am
by vrishabhsagar
Since your server runs on Unix, may I suggest a UNIX approach? Something like a shell script that queries the dsjob -log and dsjob -logdetail and parses the previous run's log to determine the status of the previous run could satisfy your requirements.

I once did this (wrote a shell function to find out the aborted jobname from current sequnece run, I can PM the code if want to have a look), it was horribly gruesome and there were a LOT of issues, but i was able to manage the trick.

Posted: Thu Jun 04, 2009 1:11 am
by ray.wurlod
Nah. Horribly gruesome, as you admit. Much easier to check for the existence of the CHECKPOINT record.

There is added complexity for multi-instance jobs, which each have a CHECKPOINT record. But viewing the RT_STATUSnnn hashed file will reveal the naming convention.

The basis of the technique is:

Code: Select all

JobName = DSGetJobInfo(DSJ.ME, DSJ.JOBNAME)
JobNumber = Trans("DS_JOBS", JobName, 5, "X")
StatusName = Convert(" ", "", "RT_STATUS" : JobNumber)
IsRestarting = (Trans(StatusName, JobName : ".CHECKPOINT", 0, "X") > "")

Posted: Thu Jun 04, 2009 5:19 am
by GavMagill
I am sitting at home at the moment so don't have access to the exact source for the routine but it is basically along the lines of the following. Note the parameters passed are Arg1=JobNo and Arg2=JobName.

The routine is called in a UserVariables Activity right at the start of the Job Sequence being run. As the Job Sequence is restartable the Job Control logic always creates a CHECKPOINT record in the RT_STATUSnnn hash file before it processes any stages but the content of the fifth field is empty until a first checkpoint is processed (this is the case when it is run from a non-restart state). If the job fails and is then restarted the content of the fifth field in the checkpoint record will be more than zero bytes (hence the test for the length greater than 0). Note I realise I need to put error handling around the Open and the Read but for testing purposes this did the job.

Code: Select all

 RoutineName = 'TestIsJobRestarting'
      Ans = 'N'
      V_RTHashFileName = 'RT_STATUS' : Arg1
      K_HashFileKey = Arg2 : '.CHECKPOINT'
      Open V_RTHashFileName To F_RTStatusHashFile Then
         Read V.Record From F_RTStatusHashFile, K_HashFileKey Then
            If LEN(V.Record<5>) > 0 Then Ans = 'Y'
         End 
         Close F_RTStatusHashFile
      End 

Also with regards dropping out to a unix script. I am trying very hard to avoid doing anything outside of the DS User Interface with this implementation. I am trying to keep the use of routines to an absolute minimum but for some things there isn't much option. The idea being that other developers on the project hopefully won't need to go poking round behind the scenes to figure out what is going on when maintaining any of the jobs. (This of course may be just wishful thinking but I am trying hard to make it so.)

Please feel free to comment on the above but be gentle, I am way outa practice. :wink:

PS: Ray, I am sure your code is way more efficient than mine but I just need to work out exactly what it is doing first. (I haven't used the Trans command before. ) :shock:

Posted: Mon Apr 09, 2012 11:45 pm
by Kryt0n
While a bit late in on this one, I've been looking in to DSCheckPointExists and to an extent figured it out. The cp$dtm is no more than a timestamp field that the function passes back, I can only guess it is the time the checkpoint was created, all you need to do is declare an empty variable and then can interrogate afterwards.

Figured I'll post just in case someone else wonders about it