Job Sequence hangs at checkpointing

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
sinhasaurabh014
Participant
Posts: 86
Joined: Wed Apr 02, 2008 2:32 am
Location: Bangalore

Job Sequence hangs at checkpointing

Post by sinhasaurabh014 »

Hi
I have many sequences that hang at certain points of the job run. Checkpointing in the job properties is selected so that the sequence is restartable on failure. A typical log from director shows as:

Code: Select all

   Item #: 1
   Message Id: IIS-DSTAGE-RUN-I-0070
   Message: Starting Job Seq_StgCMT_FACILITY_TYPE_T.

   Item #: 4
   Message Id: IIS-DSTAGE-RUN-I-0019
   Message: Seq_StgCMT_FACILITY_TYPE_T..JobControl (@Coordinator): Starting new run of checkpointed Sequence job

   Message: Seq_StgCMT_FACILITY_TYPE_T..JobControl (@Test_Source_File): Omitted checkpoint for call of routine 'DSWaitForFile'

   Message Id: IIS-DSTAGE-RUN-I-0034
   Message: Seq_StgCMT_FACILITY_TYPE_T -> (Stg_CMT_FACILITY_TYPE_T): Job run requested
Mode (row/warn limits) = 0/0

   Message: Seq_StgCMT_FACILITY_TYPE_T..JobControl (DSRunJob): Waiting for job Stg_CMT_FACILITY_TYPE_T to start

  Message: Seq_StgCMT_FACILITY_TYPE_T..JobControl (DSWaitForJob): Waiting for job Stg_CMT_FACILITY_TYPE_T to finish

  Message: Seq_StgCMT_FACILITY_TYPE_T..JobControl (DSWaitForJob): Job Stg_CMT_FACILITY_TYPE_T has finished, status = 1 (Finished OK)

   Message: Seq_StgCMT_FACILITY_TYPE_T..JobControl (@jsCMT_FACILITY_TYPE_T): Omitted checkpoint for run of job 'Stg_CMT_FACILITY_TYPE_T'

End of report.
Please note that I have deleted the unwanted messages from the log above (like env variables, parameters stuff).

Problem is that many times my many sequences hang at the last point...after the "child server job" has run, then the sequence is to checkpoint it or to omit the checkpoint. Soonafter, it has to make a call to a routine that will run a multi instance job that will populate a control table.

can somebody please tell me what could be the reason for the job hanging?
I initially thought that it may be because i am running many jobs at the same time..so the routine may not be invoked concurrently.....or the databse table lock....But I ran this particular job in isolation and i still got stuck.

Please advise
chowdhury99
Participant
Posts: 43
Joined: Thu May 29, 2008 8:41 pm

Post by chowdhury99 »

Use Excep_ErrorHandling and Terminator_Activity stages in sequence. If any exception happens it will stop the job.

Thanks
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

Next time it hangs, go to cleanup resource and then post the status of the main process you can see on that window.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
sinhasaurabh014
Participant
Posts: 86
Joined: Wed Apr 02, 2008 2:32 am
Location: Bangalore

Job Resorces details

Post by sinhasaurabh014 »

This time it hanged fro some other job and the entries from "clean up resources" are:

Code: Select all

SSELECT RT_LOG381 WITH @ID LIKE '1N0N' AND (TYPE="1") COUNT.SUP DSR_LOG @0x3576
The "child server job" jobno is 381.

My child server job has , as usual, completed OK.
After the run of the server job, I am triggering a routine that will go through the log of the previous job run and fetch me some log details which will be used as parameters to run another multi-instance job.

What is the above command in code section trying to do? How to fix this. Please advise.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Basically, it's just doing a sorted read of the job's log. Is it especially... large? Can you view it in its entirety via the Director? How long before you decided it was 'hung'?
-craig

"You can never have too many knives" -- Logan Nine Fingers
sinhasaurabh014
Participant
Posts: 86
Joined: Wed Apr 02, 2008 2:32 am
Location: Bangalore

Post by sinhasaurabh014 »

The log is quite small.....it does not go beyond 20 log entries for each sequence run...and around 10 log entries for the child job activity within the sequence to run.
My child Job Activity always finishes in some 3-5 secs...but the sequence gets hunged. I have waited for more than 10 minutes...

Another thing..this time I set the Sequence property to be not restartable. i.e. I unchecked "Mark checkpoints so sequence is restartable on failure"
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

How large is the log? Is the log corrupted?

Please advise what the sequence does. Everything it does.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sinhasaurabh014
Participant
Posts: 86
Joined: Wed Apr 02, 2008 2:32 am
Location: Bangalore

Sequence and its routine...

Post by sinhasaurabh014 »

I have automatic purge set for my project to retain only the last two job-run logs. My each sequence invocation as well as the Job activity in it does not create more than 20 log entries each.

What the Sequence does:
-----------------------------
It runs a server job and then triggers a routine, the parameter to the routine being the jobname (stagename.$JobName) and defined integer. This routine looks into the log of the 'job activity' the sequence just ran. It fetches the row count from the log.
(The job activity has only four stages--Sequential file (Source), Transformer, DB2(Target) and another Sequential file (Reject))

Next the routine runs a multi instance job to populate a control table.

Sometimes the sequence runs successfully the other times it hangs. When it hangs, I release the resources from the director and run the sequence again and it would run successfully...

Pasting below my routine code:

Code: Select all

$INCLUDE DSINCLUDE JOBCONTROL.H

      RowInput=0
      RowLoaded=0
      RowReject=0
      Status1='Failure'

      JobHandle = DSAttachJob (JobName, DSJ.ERRFATAL)

      Status=DSGetJobInfo (JobHandle, DSJ.JOBSTATUS)

      StatA=Status

      If (Status=DSJS.RUNOK or Status=DSJS.RUNWARN) Then

* The following worked in isolation but not from sequence, so commented

* RowInput=DSGetLinkInfo (JobHandle, "Source", "In", DSJ.LINKROWCOUNT)
* RowLoaded=DSGetLinkInfo (JobHandle, "Target", "Out", DSJ.LINKROWCOUNT)
* RowReject=DSGetLinkInfo (JobHandle, "Reject", "Rej", DSJ.LINKROWCOUNT)

* The following part of the code incorporates the same logic as being done by the above 3 lines i.e. fetching row counts
* It gets the counts from the log rather than from the job or link status
* In case the above logic works in all cases, the below following part of code (till the next comment) can be removed

         EventId = DSGetNewestLogId(JobHandle, DSJ.LOGINFO)
         Loop
            EventDetail = DSGetLogEntry(JobHandle, EventId)
         Until EventId<>0 And Index (EventDetail, "Stage statistics", 1)<>0 Do
            EventId = EventId-1
         Repeat

         If EventId <> 0 Then

            Event=''
            For i = 1 To Len(EventDetail)
               temp=Seq(EventDetail [i,1])
               If temp > 127 Then Event=Event:"#" Else Event=Event:EventDetail [i,1]
            Next i

            Status1='Success'

            For i = 1 To Count (Event, "#") + 1
               LogMsg=Field (Event, "#", i)
               Cnt=Field (LogMsg, " ", 1)
               If Index (LogMsg, " In", 1) <> 0 Then RowInput=Int(Cnt)
               If Index (LogMsg, " Out", 1) <> 0 Then RowLoaded=Int(Cnt)
               If Index (LogMsg, " Rej", 1) <> 0 Then RowReject=Int(Cnt)
            Next i

         End

* End of logic for fetching row counts

      End

      Errcode=DSDetachJob (JobHandle)

* Setup JobCobtrol, run it, wait for it to finish, and test for success

      hJob1 = DSAttachJob ("JobCobtrolTable.":JobId, DSJ.ERRFATAL)
      If NOT(hJob1) Then
         Call DSLogFatal("Failed to attach Control Job", "JobControl")
         Ans = 1
         Abort
      End

      Status = DSGetJobInfo (hJob1, DSJ.JOBSTATUS)
      If Status = DSJS.RUNFAILED Or Status = DSJS.CRASHED Then
         ErrCode = DSRunJob(hJob1, DSJ.RUNRESET)
         ErrCode = DSWaitForJob (hJob1)
         ErrCode = DSDetachJob (hjob1)
         hJob1 = DSAttachJob ("JobCobtrolTable.":JobId, DSJ.ERRFATAL)
      End

      paramerr = DSSetParam (hJob1, "JobId", JobId)
      paramerr = DSSetParam (hJob1, "JobName", JobName)
      paramerr = DSSetParam (hJob1, "Status1", Status1)
      paramerr = DSSetParam (hJob1, "InRows", RowInput)
      paramerr = DSSetParam (hJob1, "OutRows", RowLoaded)
      paramerr = DSSetParam (hJob1, "RejRows", RowReject)

      ErrCode = DSRunJob(hJob1, DSJ.RUNNORMAL)
      ErrCode = DSWaitForJob(hJob1)

      Status = DSGetJobInfo(hJob1, DSJ.JOBSTATUS)
      If Status = DSJS.RUNFAILED Or Status = DSJS.CRASHED Then
* Fatal Error - No Return
         Call DSLogFatal("Control Job Failed", "JobControl")
         Ans = 1
         Abort
      End
      Ans=0
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Re: Sequence and its routine...

Post by priyadarshikunal »

sinhasaurabh014 wrote:

Code: Select all

         EventId = DSGetNewestLogId(JobHandle, DSJ.LOGINFO)
         Loop
            EventDetail = DSGetLogEntry(JobHandle, EventId)
         Until EventId<>0 And Index (EventDetail, "Stage statistics", 1)<>0 Do
            EventId = EventId-1
         Repeat

         If EventId <> 0 Then
I am a bit worried about this section especially the condition in until, put a max number of iterations to make and see if it helps.

like define i=1 before the loop and after EventId = EventId-1 put

Code: Select all

i=i+1
If i>100 then exit
and see if it helps.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Try turning off the auto-purge for the jobs in question, see if that helps. There are some known 'issues' with auto-purge and MI jobs.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply