Page 1 of 1

Job Sequence hangs at checkpointing

Posted: Thu Apr 15, 2010 8:00 am
by sinhasaurabh014
Hi
I have many sequences that hang at certain points of the job run. Checkpointing in the job properties is selected so that the sequence is restartable on failure. A typical log from director shows as:

Code: Select all

   Item #: 1
   Message Id: IIS-DSTAGE-RUN-I-0070
   Message: Starting Job Seq_StgCMT_FACILITY_TYPE_T.

   Item #: 4
   Message Id: IIS-DSTAGE-RUN-I-0019
   Message: Seq_StgCMT_FACILITY_TYPE_T..JobControl (@Coordinator): Starting new run of checkpointed Sequence job

   Message: Seq_StgCMT_FACILITY_TYPE_T..JobControl (@Test_Source_File): Omitted checkpoint for call of routine 'DSWaitForFile'

   Message Id: IIS-DSTAGE-RUN-I-0034
   Message: Seq_StgCMT_FACILITY_TYPE_T -> (Stg_CMT_FACILITY_TYPE_T): Job run requested
Mode (row/warn limits) = 0/0

   Message: Seq_StgCMT_FACILITY_TYPE_T..JobControl (DSRunJob): Waiting for job Stg_CMT_FACILITY_TYPE_T to start

  Message: Seq_StgCMT_FACILITY_TYPE_T..JobControl (DSWaitForJob): Waiting for job Stg_CMT_FACILITY_TYPE_T to finish

  Message: Seq_StgCMT_FACILITY_TYPE_T..JobControl (DSWaitForJob): Job Stg_CMT_FACILITY_TYPE_T has finished, status = 1 (Finished OK)

   Message: Seq_StgCMT_FACILITY_TYPE_T..JobControl (@jsCMT_FACILITY_TYPE_T): Omitted checkpoint for run of job 'Stg_CMT_FACILITY_TYPE_T'

End of report.
Please note that I have deleted the unwanted messages from the log above (like env variables, parameters stuff).

Problem is that many times my many sequences hang at the last point...after the "child server job" has run, then the sequence is to checkpoint it or to omit the checkpoint. Soonafter, it has to make a call to a routine that will run a multi instance job that will populate a control table.

can somebody please tell me what could be the reason for the job hanging?
I initially thought that it may be because i am running many jobs at the same time..so the routine may not be invoked concurrently.....or the databse table lock....But I ran this particular job in isolation and i still got stuck.

Please advise

Posted: Thu Apr 15, 2010 8:58 am
by chowdhury99
Use Excep_ErrorHandling and Terminator_Activity stages in sequence. If any exception happens it will stop the job.

Thanks

Posted: Mon Apr 19, 2010 3:47 am
by priyadarshikunal
Next time it hangs, go to cleanup resource and then post the status of the main process you can see on that window.

Job Resorces details

Posted: Mon Apr 19, 2010 6:45 am
by sinhasaurabh014
This time it hanged fro some other job and the entries from "clean up resources" are:

Code: Select all

SSELECT RT_LOG381 WITH @ID LIKE '1N0N' AND (TYPE="1") COUNT.SUP DSR_LOG @0x3576
The "child server job" jobno is 381.

My child server job has , as usual, completed OK.
After the run of the server job, I am triggering a routine that will go through the log of the previous job run and fetch me some log details which will be used as parameters to run another multi-instance job.

What is the above command in code section trying to do? How to fix this. Please advise.

Posted: Mon Apr 19, 2010 6:48 am
by chulett
Basically, it's just doing a sorted read of the job's log. Is it especially... large? Can you view it in its entirety via the Director? How long before you decided it was 'hung'?

Posted: Mon Apr 19, 2010 7:18 am
by sinhasaurabh014
The log is quite small.....it does not go beyond 20 log entries for each sequence run...and around 10 log entries for the child job activity within the sequence to run.
My child Job Activity always finishes in some 3-5 secs...but the sequence gets hunged. I have waited for more than 10 minutes...

Another thing..this time I set the Sequence property to be not restartable. i.e. I unchecked "Mark checkpoints so sequence is restartable on failure"

Posted: Mon Apr 19, 2010 2:00 pm
by ray.wurlod
How large is the log? Is the log corrupted?

Please advise what the sequence does. Everything it does.

Sequence and its routine...

Posted: Tue Apr 20, 2010 1:27 am
by sinhasaurabh014
I have automatic purge set for my project to retain only the last two job-run logs. My each sequence invocation as well as the Job activity in it does not create more than 20 log entries each.

What the Sequence does:
-----------------------------
It runs a server job and then triggers a routine, the parameter to the routine being the jobname (stagename.$JobName) and defined integer. This routine looks into the log of the 'job activity' the sequence just ran. It fetches the row count from the log.
(The job activity has only four stages--Sequential file (Source), Transformer, DB2(Target) and another Sequential file (Reject))

Next the routine runs a multi instance job to populate a control table.

Sometimes the sequence runs successfully the other times it hangs. When it hangs, I release the resources from the director and run the sequence again and it would run successfully...

Pasting below my routine code:

Code: Select all

$INCLUDE DSINCLUDE JOBCONTROL.H

      RowInput=0
      RowLoaded=0
      RowReject=0
      Status1='Failure'

      JobHandle = DSAttachJob (JobName, DSJ.ERRFATAL)

      Status=DSGetJobInfo (JobHandle, DSJ.JOBSTATUS)

      StatA=Status

      If (Status=DSJS.RUNOK or Status=DSJS.RUNWARN) Then

* The following worked in isolation but not from sequence, so commented

* RowInput=DSGetLinkInfo (JobHandle, "Source", "In", DSJ.LINKROWCOUNT)
* RowLoaded=DSGetLinkInfo (JobHandle, "Target", "Out", DSJ.LINKROWCOUNT)
* RowReject=DSGetLinkInfo (JobHandle, "Reject", "Rej", DSJ.LINKROWCOUNT)

* The following part of the code incorporates the same logic as being done by the above 3 lines i.e. fetching row counts
* It gets the counts from the log rather than from the job or link status
* In case the above logic works in all cases, the below following part of code (till the next comment) can be removed

         EventId = DSGetNewestLogId(JobHandle, DSJ.LOGINFO)
         Loop
            EventDetail = DSGetLogEntry(JobHandle, EventId)
         Until EventId<>0 And Index (EventDetail, "Stage statistics", 1)<>0 Do
            EventId = EventId-1
         Repeat

         If EventId <> 0 Then

            Event=''
            For i = 1 To Len(EventDetail)
               temp=Seq(EventDetail [i,1])
               If temp > 127 Then Event=Event:"#" Else Event=Event:EventDetail [i,1]
            Next i

            Status1='Success'

            For i = 1 To Count (Event, "#") + 1
               LogMsg=Field (Event, "#", i)
               Cnt=Field (LogMsg, " ", 1)
               If Index (LogMsg, " In", 1) <> 0 Then RowInput=Int(Cnt)
               If Index (LogMsg, " Out", 1) <> 0 Then RowLoaded=Int(Cnt)
               If Index (LogMsg, " Rej", 1) <> 0 Then RowReject=Int(Cnt)
            Next i

         End

* End of logic for fetching row counts

      End

      Errcode=DSDetachJob (JobHandle)

* Setup JobCobtrol, run it, wait for it to finish, and test for success

      hJob1 = DSAttachJob ("JobCobtrolTable.":JobId, DSJ.ERRFATAL)
      If NOT(hJob1) Then
         Call DSLogFatal("Failed to attach Control Job", "JobControl")
         Ans = 1
         Abort
      End

      Status = DSGetJobInfo (hJob1, DSJ.JOBSTATUS)
      If Status = DSJS.RUNFAILED Or Status = DSJS.CRASHED Then
         ErrCode = DSRunJob(hJob1, DSJ.RUNRESET)
         ErrCode = DSWaitForJob (hJob1)
         ErrCode = DSDetachJob (hjob1)
         hJob1 = DSAttachJob ("JobCobtrolTable.":JobId, DSJ.ERRFATAL)
      End

      paramerr = DSSetParam (hJob1, "JobId", JobId)
      paramerr = DSSetParam (hJob1, "JobName", JobName)
      paramerr = DSSetParam (hJob1, "Status1", Status1)
      paramerr = DSSetParam (hJob1, "InRows", RowInput)
      paramerr = DSSetParam (hJob1, "OutRows", RowLoaded)
      paramerr = DSSetParam (hJob1, "RejRows", RowReject)

      ErrCode = DSRunJob(hJob1, DSJ.RUNNORMAL)
      ErrCode = DSWaitForJob(hJob1)

      Status = DSGetJobInfo(hJob1, DSJ.JOBSTATUS)
      If Status = DSJS.RUNFAILED Or Status = DSJS.CRASHED Then
* Fatal Error - No Return
         Call DSLogFatal("Control Job Failed", "JobControl")
         Ans = 1
         Abort
      End
      Ans=0

Re: Sequence and its routine...

Posted: Tue Apr 20, 2010 3:41 am
by priyadarshikunal
sinhasaurabh014 wrote:

Code: Select all

         EventId = DSGetNewestLogId(JobHandle, DSJ.LOGINFO)
         Loop
            EventDetail = DSGetLogEntry(JobHandle, EventId)
         Until EventId<>0 And Index (EventDetail, "Stage statistics", 1)<>0 Do
            EventId = EventId-1
         Repeat

         If EventId <> 0 Then
I am a bit worried about this section especially the condition in until, put a max number of iterations to make and see if it helps.

like define i=1 before the loop and after EventId = EventId-1 put

Code: Select all

i=i+1
If i>100 then exit
and see if it helps.

Posted: Tue Apr 20, 2010 6:26 am
by chulett
Try turning off the auto-purge for the jobs in question, see if that helps. There are some known 'issues' with auto-purge and MI jobs.