DS Job in not completing.

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

rsripathy
Participant
Posts: 21
Joined: Wed Sep 19, 2007 8:46 am

DS Job in not completing.

Post by rsripathy »

DS Job Design is SEQfile -> MdfyStg -> TrnasfrStg -> FunlStg -> SQLEnterpriseStg.

Functionality:- The job reads 1 file at a time and loads into target table.
I am maintaining a different table called AuditTable , which is handled by BeforeJob & AfterJob SubRoutines ( Collects, file count, file name, no of rows loaded , tgt table, any warning message etc ).
This job is 24/7 ( handled by a Sequence job.) The job is running fine from past 5 months. But recently it is getting hanged in middle at below point...

Environment variable settings: (..)
Parallel job initiated
main_program: IBM InfoSphere DataStage Enterprise Edition 8.7.0.6118 (..)
main_program: conductor uname: -s=Windows_NT; -r=1; -v=6; -n=HODSENGD01; -m=Pentium
main_program: orchgeneral: loaded (orchsort: loaded)
sqlTgtAMBSwrt: Process meta data not available in database
main_program: APT configuration file: C:/IBM/InformationServer/Server/Configurations/default.apt
(not processing further )


Please help.
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Re: DS Job in not completing.

Post by SURA »

I had this issue good number of times, but not able to find the cause. This will happen to some specific job(s), you need to raise a PMR is the best way to resolve this issue.

The problem is, your day to day prod run will not complete successfully due to this reason. Better to raise PMR.
Thanks
Ram
----------------------------------
Revealing your ignorance is fine, because you get a chance to learn.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Could you put some CALL DSLogInfo() calls into your before-job routine, particularly one right before the RETURN statement? That will ensure that the cause of the hang is in the DataStage job and not in the before-job code.
rsripathy
Participant
Posts: 21
Joined: Wed Sep 19, 2007 8:46 am

Post by rsripathy »

Thanks ArndW & SURA...

ArndW:
I am using the DSLogInfo() -- "Record inserted successfully into Job Audit Table"; In fact this statemtent should pop out to DS log if record inserted successfully. If no return code from then go to ErrorExit section in routine.
To check more log, I used $APT_STARTUP_STATUS = True;
Here is the full log...

Starting Job djpXTMR_AMBSLoad.
djpXTMR_AMBSLoad..BeforeJob (JOBNAME): The job name is :djpXTMR_AMBSLoad
djpXTMR_AMBSLoad..BeforeJob (USERNAME): The user id running this job is :hsbc5423
djpXTMR_AMBSLoad..BeforeJob (BeforeJobStats): " At this point getting last max(time) of the job in Audit Table "
djpXTMR_AMBSLoad..BeforeJob (GetLatestTime): The return value of query is :
--------------------------------------
2013-01-02 10:32:49.1000000
(1 rows affected)
djpXTMR_AMBSLoad..BeforeJob (GetLastStatus): Got the status of last job run successfully from AUDIT_CREDIT_SYSTEMS table
djpXTMR_AMBSLoad..BeforeJob (BeforeJobStats): " At this point insert statement into Audit table "
djpXTMR_AMBSLoad..BeforeJob (BeforeJobStats): Record inserted successfully into Job Audit Table
Environment variable settings: (..)
Parallel job initiated
main_program: IBM InfoSphere DataStage Enterprise Edition 8.7.0.6118 (..)
main_program: orchgeneral: loaded(orchsort: loaded, orchstats: loaded)
sqlTgtXTMR_AMBSwrt: Process meta data not available in database
main_program: APT configuration file: D:/IBM/InformationServer/Server/Configurations/default.apt
main_program: Max inbound connections per node = 12, Max outbound connections per node = 12
main_program: Defining Section Leaders.
main_program: Contacting Section Leaders.
main_program: Broadcasting score.
main_program: Score (116,767 bytes) sent.
main_program: Starting Players.
main_program: Waiting for Players to start.
main_program: Setting up data connections among Players.
main_program: Starting step execution.
main_program: Waiting for step completion.
(No further processing from this point...)
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Odd indeed. I can't check now, but there are a number of additional logging parameters you could set including APT_DUMP_SCORE, and then variables something like 'report player calls' and 'report process ids', etc. Perhaps that might help you narrow down exactly which call from which process is the last to be executed.
On a similar note, do you have "osh.exe" processes on the datastage server when this happens?
rsripathy
Participant
Posts: 21
Joined: Wed Sep 19, 2007 8:46 am

Post by rsripathy »

Thanks ArndW,

I added APT_DUMP_SCORE, but I could not able to locate the 'report player calls' or 'report process ids' which you mentioned. How to check for the processes on ds server on "osh.exe" ?

Thanks...
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

$APT_PM_SHOW_PIDS, $OSH_DUMP, $OSH_ECHO can be used. Basically, all of the entries in the "reporting" category of the parameter list can be turned to to see if it helps narrow down the error.
rsripathy
Participant
Posts: 21
Joined: Wed Sep 19, 2007 8:46 am

Post by rsripathy »

Added the APT variables;

main_program: This step has 6 datasets:
ds0: {op0[1p] (sequential fsqSrcAMBSsel)
eAny<>eCollectAny
op1[2p] (parallel APT_CombinedOperatorController:{natural=mdf, synthetic="modify(0)"})}
ds1: {op1[2p] (parallel APT_CombinedOperatorController:APT_TransformOperatorImplV47S7_djpXTMR_AMBSLoad_tfpAMBS in tfpAMBS)
eAny=>eCollectAny
op2[2p] (parallel buffer(1))}
ds2: {op1[2p] (parallel APT_CombinedOperatorController:APT_TransformOperatorImplV47S7_djpXTMR_AMBSLoad_tfpAMBS in tfpAMBS)
eAny=>eCollectAny
op3[2p] (parallel buffer(0))}
ds3: {op2[2p] (parallel buffer(1))
eSame=>eCollectAny
op4[2p] (parallel fnlAMBS)}
ds4: {op3[2p] (parallel buffer(0))
eSame=>eCollectAny
op4[2p] (parallel fnlAMBS)}
ds5: {op4[2p] (parallel fnlAMBS)
eAny=>eCollectAny
op5[2p] (parallel APT_ODBCUpsertOperator in odbTgtXTMR_AMBSwrt)}
It has 6 operators:
op0[1p] {(sequential fsqSrcAMBSsel)
on nodes (
node1[op0,p0]
)}
op1[2p] {(parallel APT_CombinedOperatorController:
({natural=mdf, synthetic="modify(0)"})
(APT_TransformOperatorImplV47S7_djpXTMR_AMBSLoad_tfpAMBS in tfpAMBS)
) on nodes (
node1[op1,p0]
node2[op1,p1]
)}
op2[2p] {(parallel buffer(1))
on nodes (
node1[op2,p0]
node2[op2,p1]
)}
op3[2p] {(parallel buffer(0))
on nodes (
node1[op3,p0]
node2[op3,p1]
)}
op4[2p] {(parallel fnlAMBS)
on nodes (
node1[op4,p0]
node2[op4,p1]
)}
op5[2p] {(parallel APT_ODBCUpsertOperator in odbTgtXTMR_AMBSwrt)
on nodes (
node1[op5,p0]
node2[op5,p1]
)}
It runs 11 processes on 2 nodes.

main_program: APT_PMsectionLeader: new Section Leader running, PID = 39636, note that this PID may be transient, refer to TFPM-00201 for the final Section Leader PID.
APT_PMsectionLeader: new Section Leader running, PID = 44888, note that this PID may be transient, refer to TFPM-00201 for the final Section Leader PID.

fnlAMBS,1: APT_PMPlayer: new Player running, PID = 38704, spawned from Section Leader, PID = 44888
odbTgtXTMR_AMBSwrt,1: APT_PMPlayer: new Player running, PID = 51280, spawned from Section Leader, PID = 44888
buffer(0),1: APT_PMPlayer: new Player running, PID = 47472, spawned from Section Leader, PID = 44888
buffer(1),1: APT_PMPlayer: new Player running, PID = 55416, spawned from Section Leader, PID = 44888
APT_CombinedOperatorController,1: APT_PMPlayer: new Player running, PID = 32700, spawned from Section Leader, PID = 44888
fsqSrcAMBSsel,0: APT_PMPlayer: new Player running, PID = 10988, spawned from Section Leader, PID = 39636
fnlAMBS,0: APT_PMPlayer: new Player running, PID = 23728, spawned from Section Leader, PID = 39636
odbTgtXTMR_AMBSwrt,0: APT_PMPlayer: new Player running, PID = 53752, spawned from Section Leader, PID = 39636
buffer(0),0: APT_PMPlayer: new Player running, PID = 56252, spawned from Section Leader, PID = 39636
buffer(1),0: APT_PMPlayer: new Player running, PID = 18332, spawned from Section Leader, PID = 39636
(No further processing from this point...)
ShaneMuir
Premium Member
Premium Member
Posts: 508
Joined: Tue Jun 15, 2004 5:00 am
Location: London

Post by ShaneMuir »

Are you able to check for table/record locks on your target tables? I had this problem a couple of times and each time it was that the table and/or record lock.
rsripathy
Participant
Posts: 21
Joined: Wed Sep 19, 2007 8:46 am

Post by rsripathy »

Yes, we did check on database side as well. There are no locks on target table.
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

Post by zulfi123786 »

I have seen cases where operator combination causes job to hang, try running with APT_DISABLE_COMBINATION=TRUE
- Zulfi
rsripathy
Participant
Posts: 21
Joined: Wed Sep 19, 2007 8:46 am

Post by rsripathy »

I added it and tested.. but no luck... :( it got again stucked ... it is waiting for some thing.. which I am not able to figure out.....

If I removed the BeforeJob/AfterJob Sub-Routines, then it is working. If you say, that there is problem with BeforeJob Routine,then, it was clear from logs.. that control returned from routine, and the job got initiated.. well.. but stucked....
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

Post by zulfi123786 »

If transformer is replaced with copy and database with dataset would the job still behave the same ? Also track the player processes pid's to see if all of them on both nodes keep running or if any finish
- Zulfi
rsripathy
Participant
Posts: 21
Joined: Wed Sep 19, 2007 8:46 am

Post by rsripathy »

Thanks Zulfi...

If you say that there might be problem with target db, i tested the job even without before/after routines, the job successfully loaded the records into the target table.
So do you think it more concerns over target table ?
More over, I used the SHOW_PIDS.. and all processes were running on both nodes... if I see the log, for each of the stage, (stagename,0):APT_PMPlayer and (stagename,1):APT_PMPlayer, pids.. were running.. Is this correct way to check if all processes were running or is there any other way to check ??
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

Post by zulfi123786 »

The log shows that pid's are created and then the log doesn't get updated so you can take the player pid's and go to the node where they are running and run ps command to see if all those pid's keep running or did the player pid's finish after execution and then check if the section leader finish too.

Its basically a trial and error approach to narrow down to the root cause as to what is causing the issue and even if its found the solution of which might end with PMR
- Zulfi
Post Reply