Page 1 of 2

DS Job in not completing.

Posted: Wed Jan 02, 2013 3:33 pm
by rsripathy
DS Job Design is SEQfile -> MdfyStg -> TrnasfrStg -> FunlStg -> SQLEnterpriseStg.

Functionality:- The job reads 1 file at a time and loads into target table.
I am maintaining a different table called AuditTable , which is handled by BeforeJob & AfterJob SubRoutines ( Collects, file count, file name, no of rows loaded , tgt table, any warning message etc ).
This job is 24/7 ( handled by a Sequence job.) The job is running fine from past 5 months. But recently it is getting hanged in middle at below point...

Environment variable settings: (..)
Parallel job initiated
main_program: IBM InfoSphere DataStage Enterprise Edition 8.7.0.6118 (..)
main_program: conductor uname: -s=Windows_NT; -r=1; -v=6; -n=HODSENGD01; -m=Pentium
main_program: orchgeneral: loaded (orchsort: loaded)
sqlTgtAMBSwrt: Process meta data not available in database
main_program: APT configuration file: C:/IBM/InformationServer/Server/Configurations/default.apt
(not processing further )


Please help.

Re: DS Job in not completing.

Posted: Wed Jan 02, 2013 5:34 pm
by SURA
I had this issue good number of times, but not able to find the cause. This will happen to some specific job(s), you need to raise a PMR is the best way to resolve this issue.

The problem is, your day to day prod run will not complete successfully due to this reason. Better to raise PMR.

Posted: Thu Jan 03, 2013 5:09 am
by ArndW
Could you put some CALL DSLogInfo() calls into your before-job routine, particularly one right before the RETURN statement? That will ensure that the cause of the hang is in the DataStage job and not in the before-job code.

Posted: Thu Jan 03, 2013 7:57 am
by rsripathy
Thanks ArndW & SURA...

ArndW:
I am using the DSLogInfo() -- "Record inserted successfully into Job Audit Table"; In fact this statemtent should pop out to DS log if record inserted successfully. If no return code from then go to ErrorExit section in routine.
To check more log, I used $APT_STARTUP_STATUS = True;
Here is the full log...

Starting Job djpXTMR_AMBSLoad.
djpXTMR_AMBSLoad..BeforeJob (JOBNAME): The job name is :djpXTMR_AMBSLoad
djpXTMR_AMBSLoad..BeforeJob (USERNAME): The user id running this job is :hsbc5423
djpXTMR_AMBSLoad..BeforeJob (BeforeJobStats): " At this point getting last max(time) of the job in Audit Table "
djpXTMR_AMBSLoad..BeforeJob (GetLatestTime): The return value of query is :
--------------------------------------
2013-01-02 10:32:49.1000000
(1 rows affected)
djpXTMR_AMBSLoad..BeforeJob (GetLastStatus): Got the status of last job run successfully from AUDIT_CREDIT_SYSTEMS table
djpXTMR_AMBSLoad..BeforeJob (BeforeJobStats): " At this point insert statement into Audit table "
djpXTMR_AMBSLoad..BeforeJob (BeforeJobStats): Record inserted successfully into Job Audit Table
Environment variable settings: (..)
Parallel job initiated
main_program: IBM InfoSphere DataStage Enterprise Edition 8.7.0.6118 (..)
main_program: orchgeneral: loaded(orchsort: loaded, orchstats: loaded)
sqlTgtXTMR_AMBSwrt: Process meta data not available in database
main_program: APT configuration file: D:/IBM/InformationServer/Server/Configurations/default.apt
main_program: Max inbound connections per node = 12, Max outbound connections per node = 12
main_program: Defining Section Leaders.
main_program: Contacting Section Leaders.
main_program: Broadcasting score.
main_program: Score (116,767 bytes) sent.
main_program: Starting Players.
main_program: Waiting for Players to start.
main_program: Setting up data connections among Players.
main_program: Starting step execution.
main_program: Waiting for step completion.
(No further processing from this point...)

Posted: Thu Jan 03, 2013 12:14 pm
by ArndW
Odd indeed. I can't check now, but there are a number of additional logging parameters you could set including APT_DUMP_SCORE, and then variables something like 'report player calls' and 'report process ids', etc. Perhaps that might help you narrow down exactly which call from which process is the last to be executed.
On a similar note, do you have "osh.exe" processes on the datastage server when this happens?

Posted: Mon Jan 07, 2013 8:08 am
by rsripathy
Thanks ArndW,

I added APT_DUMP_SCORE, but I could not able to locate the 'report player calls' or 'report process ids' which you mentioned. How to check for the processes on ds server on "osh.exe" ?

Thanks...

Posted: Mon Jan 07, 2013 8:16 am
by ArndW
$APT_PM_SHOW_PIDS, $OSH_DUMP, $OSH_ECHO can be used. Basically, all of the entries in the "reporting" category of the parameter list can be turned to to see if it helps narrow down the error.

Posted: Mon Jan 07, 2013 10:17 am
by rsripathy
Added the APT variables;

main_program: This step has 6 datasets:
ds0: {op0[1p] (sequential fsqSrcAMBSsel)
eAny<>eCollectAny
op1[2p] (parallel APT_CombinedOperatorController:{natural=mdf, synthetic="modify(0)"})}
ds1: {op1[2p] (parallel APT_CombinedOperatorController:APT_TransformOperatorImplV47S7_djpXTMR_AMBSLoad_tfpAMBS in tfpAMBS)
eAny=>eCollectAny
op2[2p] (parallel buffer(1))}
ds2: {op1[2p] (parallel APT_CombinedOperatorController:APT_TransformOperatorImplV47S7_djpXTMR_AMBSLoad_tfpAMBS in tfpAMBS)
eAny=>eCollectAny
op3[2p] (parallel buffer(0))}
ds3: {op2[2p] (parallel buffer(1))
eSame=>eCollectAny
op4[2p] (parallel fnlAMBS)}
ds4: {op3[2p] (parallel buffer(0))
eSame=>eCollectAny
op4[2p] (parallel fnlAMBS)}
ds5: {op4[2p] (parallel fnlAMBS)
eAny=>eCollectAny
op5[2p] (parallel APT_ODBCUpsertOperator in odbTgtXTMR_AMBSwrt)}
It has 6 operators:
op0[1p] {(sequential fsqSrcAMBSsel)
on nodes (
node1[op0,p0]
)}
op1[2p] {(parallel APT_CombinedOperatorController:
({natural=mdf, synthetic="modify(0)"})
(APT_TransformOperatorImplV47S7_djpXTMR_AMBSLoad_tfpAMBS in tfpAMBS)
) on nodes (
node1[op1,p0]
node2[op1,p1]
)}
op2[2p] {(parallel buffer(1))
on nodes (
node1[op2,p0]
node2[op2,p1]
)}
op3[2p] {(parallel buffer(0))
on nodes (
node1[op3,p0]
node2[op3,p1]
)}
op4[2p] {(parallel fnlAMBS)
on nodes (
node1[op4,p0]
node2[op4,p1]
)}
op5[2p] {(parallel APT_ODBCUpsertOperator in odbTgtXTMR_AMBSwrt)
on nodes (
node1[op5,p0]
node2[op5,p1]
)}
It runs 11 processes on 2 nodes.

main_program: APT_PMsectionLeader: new Section Leader running, PID = 39636, note that this PID may be transient, refer to TFPM-00201 for the final Section Leader PID.
APT_PMsectionLeader: new Section Leader running, PID = 44888, note that this PID may be transient, refer to TFPM-00201 for the final Section Leader PID.

fnlAMBS,1: APT_PMPlayer: new Player running, PID = 38704, spawned from Section Leader, PID = 44888
odbTgtXTMR_AMBSwrt,1: APT_PMPlayer: new Player running, PID = 51280, spawned from Section Leader, PID = 44888
buffer(0),1: APT_PMPlayer: new Player running, PID = 47472, spawned from Section Leader, PID = 44888
buffer(1),1: APT_PMPlayer: new Player running, PID = 55416, spawned from Section Leader, PID = 44888
APT_CombinedOperatorController,1: APT_PMPlayer: new Player running, PID = 32700, spawned from Section Leader, PID = 44888
fsqSrcAMBSsel,0: APT_PMPlayer: new Player running, PID = 10988, spawned from Section Leader, PID = 39636
fnlAMBS,0: APT_PMPlayer: new Player running, PID = 23728, spawned from Section Leader, PID = 39636
odbTgtXTMR_AMBSwrt,0: APT_PMPlayer: new Player running, PID = 53752, spawned from Section Leader, PID = 39636
buffer(0),0: APT_PMPlayer: new Player running, PID = 56252, spawned from Section Leader, PID = 39636
buffer(1),0: APT_PMPlayer: new Player running, PID = 18332, spawned from Section Leader, PID = 39636
(No further processing from this point...)

Posted: Mon Jan 07, 2013 10:23 am
by ShaneMuir
Are you able to check for table/record locks on your target tables? I had this problem a couple of times and each time it was that the table and/or record lock.

Posted: Mon Jan 07, 2013 10:41 am
by rsripathy
Yes, we did check on database side as well. There are no locks on target table.

Posted: Mon Jan 07, 2013 2:17 pm
by zulfi123786
I have seen cases where operator combination causes job to hang, try running with APT_DISABLE_COMBINATION=TRUE

Posted: Mon Jan 07, 2013 2:29 pm
by rsripathy
I added it and tested.. but no luck... :( it got again stucked ... it is waiting for some thing.. which I am not able to figure out.....

If I removed the BeforeJob/AfterJob Sub-Routines, then it is working. If you say, that there is problem with BeforeJob Routine,then, it was clear from logs.. that control returned from routine, and the job got initiated.. well.. but stucked....

Posted: Mon Jan 07, 2013 3:00 pm
by zulfi123786
If transformer is replaced with copy and database with dataset would the job still behave the same ? Also track the player processes pid's to see if all of them on both nodes keep running or if any finish

Posted: Mon Jan 07, 2013 3:37 pm
by rsripathy
Thanks Zulfi...

If you say that there might be problem with target db, i tested the job even without before/after routines, the job successfully loaded the records into the target table.
So do you think it more concerns over target table ?
More over, I used the SHOW_PIDS.. and all processes were running on both nodes... if I see the log, for each of the stage, (stagename,0):APT_PMPlayer and (stagename,1):APT_PMPlayer, pids.. were running.. Is this correct way to check if all processes were running or is there any other way to check ??

Posted: Mon Jan 07, 2013 10:03 pm
by zulfi123786
The log shows that pid's are created and then the log doesn't get updated so you can take the player pid's and go to the node where they are running and run ps command to see if all those pid's keep running or did the player pid's finish after execution and then check if the section leader finish too.

Its basically a trial and error approach to narrow down to the root cause as to what is causing the issue and even if its found the solution of which might end with PMR