Job hangs while running

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
peterxu
Participant
Posts: 8
Joined: Fri Sep 22, 2006 2:10 am

Job hangs while running

Post by peterxu »

Hi, I got trapped in a weird problem. Our ds jobs are deployed in 5 nodes, 1 conductor node and 4 players and all the jobs base on the same design template. After the conduct node is upgraded from P550 to P570, some jobs are randomly hanged there during the running process. And the problem jobs are randomly chosen. No particular ones. I'm totally confused by this. So could somebody help me out here? Thanks. Following is a detail tracing log for a hanging job.

Code: Select all

##I TFCN 000001 00:14:53(000) <main_program> 
IBM DataStage(tm) Parallel Extender 7.5.3
Copyright (c) 2004, 1997-2004 Ascential Software Corporation.
All Rights Reserved


##I TOSH 000002 00:14:53(001) <main_program> orchgeneral: loaded
##I TOSH 000002 00:14:53(002) <main_program> orchsort: loaded
##I TOSH 000002 00:14:53(003) <main_program> orchstats: loaded
##I TFSC 000001 00:14:53(004) <main_program> APT configuration file: /home/dsadm/Ascential/DataStage/Configurations/LP5.apt
##I TFPM 000324 00:14:54(000) <APT_LicenseCheckOp in APT_LicenseOperator,0> Calling runLocally: step=0, node=node3, op=1, ptn=0
##I TFPM 000324 00:14:54(000) <APT_LicenseCountOp in APT_LicenseOperator,0> Calling runLocally: step=0, node=node3, op=0, ptn=0
##I TFPM 000324 00:14:54(000) <APT_LicenseCountOp in APT_LicenseOperator,1> Calling runLocally: step=0, node=node4, op=0, ptn=1
##I TFOP 000094 00:14:54(001) <APT_LicenseCountOp in APT_LicenseOperator,1> Output 0 produced 1 records.
##I TFPM 000325 00:14:54(002) <APT_LicenseCountOp in APT_LicenseOperator,1> Operator completed. status: APT_StatusOk  elapsed: 0.00  user: 0.01  sys: 0.00 (total CPU: 0.01)
##I TFPM 000326 00:14:54(003) <APT_LicenseCountOp in APT_LicenseOperator,1> Heap growth during runLocally(): 196624 bytes
##I TFPM 000324 00:14:54(000) <APT_LicenseCountOp in APT_LicenseOperator,2> Calling runLocally: step=0, node=node6, op=0, ptn=2
##I TFOP 000094 00:14:54(001) <APT_LicenseCountOp in APT_LicenseOperator,2> Output 0 produced 1 records.
##I TFPM 000325 00:14:54(002) <APT_LicenseCountOp in APT_LicenseOperator,2> Operator completed. status: APT_StatusOk  elapsed: 0.00  user: 0.01  sys: 0.00 (total CPU: 0.01)
##I TFPM 000326 00:14:54(003) <APT_LicenseCountOp in APT_LicenseOperator,2> Heap growth during runLocally(): 196624 bytes
##I TLIC 000163 00:14:54(001) <APT_LicenseCheckOp in APT_LicenseOperator,0> Input 0 consumed 3 records.
##I TFOP 000094 00:14:54(001) <APT_LicenseCountOp in APT_LicenseOperator,0> Output 0 produced 1 records.
##I TFPM 000325 00:14:54(002) <APT_LicenseCountOp in APT_LicenseOperator,0> Operator completed. status: APT_StatusOk  elapsed: 0.00  user: 0.01  sys: 0.00 (total CPU: 0.01)
##I TFPM 000325 00:14:54(002) <APT_LicenseCheckOp in APT_LicenseOperator,0> Operator completed. status: APT_StatusOk  elapsed: 0.00  user: 0.00  sys: 0.00 (total CPU: 0.00)
##I TFPM 000326 00:14:54(003) <APT_LicenseCheckOp in APT_LicenseOperator,0> Heap growth during runLocally(): 0 bytes
##I TFPM 000326 00:14:54(003) <APT_LicenseCountOp in APT_LicenseOperator,0> Heap growth during runLocally(): 0 bytes
##W TCOS 000049 00:14:54(000) <main_program> Parameter specified but not used in flow: DSHostName [osl/osl.C:650]
##W TCOS 000049 00:14:54(001) <main_program> Parameter specified but not used in flow: DSProjectName [osl/osl.C:650]
##W TCOS 000049 00:14:54(002) <main_program> Parameter specified but not used in flow: DSJobController [osl/osl.C:650]
##W TCOS 000049 00:14:54(003) <main_program> Parameter specified but not used in flow: DSJobName [osl/osl.C:650]
##W TCOS 000049 00:14:54(004) <main_program> Parameter specified but not used in flow: DSJobStartTimestamp [osl/osl.C:650]
##W TCOS 000049 00:14:54(005) <main_program> Parameter specified but not used in flow: DSJobStartDate [osl/osl.C:650]
##W TCOS 000049 00:14:54(006) <main_program> Parameter specified but not used in flow: DSJobStartTime [osl/osl.C:650]
##W TCOS 000049 00:14:54(007) <main_program> Parameter specified but not used in flow: DSJobWaveNo [osl/osl.C:650]
##W TCOS 000049 00:14:54(008) <main_program> Parameter specified but not used in flow: DSJobInvocationId [osl/osl.C:650]
##W TCOS 000049 00:14:54(009) <main_program> Parameter specified but not used in flow: DSTCPPort [osl/osl.C:650]
##W TCOS 000049 00:14:54(010) <main_program> Parameter specified but not used in flow: DSInstallTag [osl/osl.C:650]
##W TCOS 000049 00:14:54(011) <main_program> Parameter specified but not used in flow: DSProjectMapName [osl/osl.C:650]
##W TCOS 000049 00:14:54(012) <main_program> Parameter specified but not used in flow: _REJ_PATH [osl/osl.C:650]
##W TCOS 000049 00:14:54(013) <main_program> Parameter specified but not used in flow: _WRK_PATH [osl/osl.C:650]
##I TFPM 000324 00:15:09(000) <APT_LUTCreateOp in LK_CL,0> Calling runLocally: step=1, node=node3, op=3, ptn=0
##I TFPM 000324 00:15:09(000) <buffer(0),0> Calling runLocally: step=1, node=node3, op=2, ptn=0
##I TFPM 000324 00:15:09(000) <APT_CombinedOperatorController(0),0> Calling runLocally: step=1, node=node3, op=1, ptn=0
##I TFPM 000324 00:15:09(000) <APT_CombinedOperatorController(2),0> Calling runLocally: step=1, node=node3, op=4, ptn=0
##I TFPM 000324 00:15:09(000) <APT_CombinedOperatorController(1),0> Calling runLocally: step=1, node=node3, op=0, ptn=0
##I TFPM 000324 00:15:10(000) <APT_LUTCreateOp in LK_CL,1> Calling runLocally: step=1, node=node4, op=3, ptn=1
##I TFPM 000324 00:15:10(000) <APT_CombinedOperatorController(0),1> Calling runLocally: step=1, node=node4, op=1, ptn=1
##I TFPM 000324 00:15:10(000) <buffer(0),1> Calling runLocally: step=1, node=node4, op=2, ptn=1
##I TFPM 000324 00:15:10(000) <APT_CombinedOperatorController(1),1> Calling runLocally: step=1, node=node4, op=0, ptn=1
##I TFPM 000324 00:15:10(000) <APT_CombinedOperatorController(2),1> Calling runLocally: step=1, node=node4, op=4, ptn=1
##I TFPM 000324 00:15:10(000) <APT_LUTCreateOp in LK_CL,2> Calling runLocally: step=1, node=node6, op=3, ptn=2
##I TFPM 000324 00:15:10(000) <buffer(0),2> Calling runLocally: step=1, node=node6, op=2, ptn=2
##I TFPM 000324 00:15:10(000) <APT_CombinedOperatorController(0),2> Calling runLocally: step=1, node=node6, op=1, ptn=2
##I TFPM 000324 00:15:10(000) <APT_CombinedOperatorController(1),2> Calling runLocally: step=1, node=node6, op=0, ptn=2
##I TFPM 000324 00:15:10(000) <APT_CombinedOperatorController(2),2> Calling runLocally: step=1, node=node6, op=4, ptn=2
##I TOCP 000163 00:15:10(001) <DS_Input_BTFMSNCD_20,2> Input 0 consumed 25 records.
##I TOCP 000094 00:15:10(002) <DS_Input_BTFMSNCD_20,2> Output 0 produced 25 records.
##I TOCP 000163 00:15:10(001) <DS_Input_BTFMSNCD_20,1> Input 0 consumed 28 records.
##I TOCP 000094 00:15:10(002) <DS_Input_BTFMSNCD_20,1> Output 0 produced 28 records.
##I TFOP 000163 00:15:10(003) <APT_TransformOperatorImplV0S89_DS_CORE_BTFMSNCD_20_TR_BTFMSNCD_DUMMY in TR_BTFMSNCD_DUMMY,1> Input 0 consumed 28 records.
##I TFOP 000094 00:15:10(004) <APT_TransformOperatorImplV0S89_DS_CORE_BTFMSNCD_20_TR_BTFMSNCD_DUMMY in TR_BTFMSNCD_DUMMY,1> Output 0 produced 28 records.
##I TFPM 000325 00:15:10(005) <APT_CombinedOperatorController(0),1> Operator completed. status: APT_StatusOk  elapsed: 0.01  user: 0.01  sys: 0.00 (total CPU: 0.01)
##I TFPM 000326 00:15:10(006) <APT_CombinedOperatorController(0),1> Heap growth during runLocally(): 0 bytes
##I TFOP 000163 00:15:10(003) <APT_TransformOperatorImplV0S89_DS_CORE_BTFMSNCD_20_TR_BTFMSNCD_DUMMY in TR_BTFMSNCD_DUMMY,2> Input 0 consumed 25 records.
##I TFOP 000094 00:15:10(004) <APT_TransformOperatorImplV0S89_DS_CORE_BTFMSNCD_20_TR_BTFMSNCD_DUMMY in TR_BTFMSNCD_DUMMY,2> Output 0 produced 25 records.
##I TFPM 000325 00:15:10(005) <APT_CombinedOperatorController(0),2> Operator completed. status: APT_StatusOk  elapsed: 0.01  user: 0.01  sys: 0.00 (total CPU: 0.01)
##I TFPM 000326 00:15:10(006) <APT_CombinedOperatorController(0),2> Heap growth during runLocally(): 0 bytes
##I TOCP 000163 00:15:10(000) <DS_Input_BTFMSNCD_20,0> Input 0 consumed 26 records.
##I TOCP 000094 00:15:10(001) <DS_Input_BTFMSNCD_20,0> Output 0 produced 26 records.
##I TFOP 000163 00:15:10(002) <APT_TransformOperatorImplV0S89_DS_CORE_BTFMSNCD_20_TR_BTFMSNCD_DUMMY in TR_BTFMSNCD_DUMMY,0> Input 0 consumed 26 records.
##I TFOP 000094 00:15:10(003) <APT_TransformOperatorImplV0S89_DS_CORE_BTFMSNCD_20_TR_BTFMSNCD_DUMMY in TR_BTFMSNCD_DUMMY,0> Output 0 produced 26 records.
##I TFPM 000325 00:15:10(004) <APT_CombinedOperatorController(0),0> Operator completed. status: APT_StatusOk  elapsed: 0.02  user: 0.01  sys: 0.00 (total CPU: 0.01)
##I TFPM 000326 00:15:10(005) <APT_CombinedOperatorController(0),0> Heap growth during runLocally(): 0 bytes
##I TOCP 000163 00:15:10(001) <DS_VAL_X_CL,1> Input 0 consumed 9997 records.
##I TOCP 000094 00:15:10(002) <DS_VAL_X_CL,1> Output 0 produced 9997 records.
##I TFLT 000163 00:15:10(003) <FLT_VAL_X_CL01,1> Input 0 consumed 9997 records.
##I TFLT 000094 00:15:10(004) <FLT_VAL_X_CL01,1> Output 0 produced 1 records.
##I TFPM 000325 00:15:10(005) <APT_CombinedOperatorController(1),1> Operator completed. status: APT_StatusOk  elapsed: 0.03  user: 0.02  sys: 0.01 (total CPU: 0.03)
##I TFPM 000326 00:15:10(006) <APT_CombinedOperatorController(1),1> Heap growth during runLocally(): 0 bytes
##I TOCP 000163 00:15:10(000) <DS_VAL_X_CL,0> Input 0 consumed 9998 records.
##I TOCP 000094 00:15:10(001) <DS_VAL_X_CL,0> Output 0 produced 9998 records.
##I TFLT 000163 00:15:10(002) <FLT_VAL_X_CL01,0> Input 0 consumed 9998 records.
##I TFLT 000094 00:15:10(003) <FLT_VAL_X_CL01,0> Output 0 produced 1 records.
##I TFPM 000325 00:15:10(004) <APT_CombinedOperatorController(1),0> Operator completed. status: APT_StatusOk  elapsed: 0.04  user: 0.02  sys: 0.01 (total CPU: 0.03)
##I TOCP 000163 00:15:10(001) <DS_VAL_X_CL,2> Input 0 consumed 9997 records.
##I TOCP 000094 00:15:10(002) <DS_VAL_X_CL,2> Output 0 produced 9997 records.
##I TFPM 000326 00:15:10(005) <APT_CombinedOperatorController(1),0> Heap growth during runLocally(): 0 bytes
##I TFPM 000325 00:15:10(000) <buffer(0),0> Operator completed. status: APT_StatusOk  elapsed: 0.05  user: 0.00  sys: 0.00 (total CPU: 0.00)
##I TFPM 000326 00:15:10(001) <buffer(0),0> Heap growth during runLocally(): 0 bytes
##I TOLP 000163 00:15:10(000) <APT_LUTProcessOp in LK_CL,0> Input 0 consumed 26 records.
##I TOLP 000163 00:15:10(001) <APT_LUTProcessOp in LK_CL,0> Input 1 consumed 2 records.
##I TOLP 000163 00:15:10(002) <APT_LUTProcessOp in LK_CL,0> Input 2 consumed 0 records.
##I TOLP 000094 00:15:10(003) <APT_LUTProcessOp in LK_CL,0> Output 0 produced 26 records.
##I TOLC 000163 00:15:10(000) <APT_LUTCreateOp in LK_CL,0> Input 0 consumed 2 records.
##I TOLC 000094 00:15:10(001) <APT_LUTCreateOp in LK_CL,0> Output 0 produced 1 records.
##I TOLC 000094 00:15:10(002) <APT_LUTCreateOp in LK_CL,0> Output 1 produced 0 records.
##I TFPM 000325 00:15:10(001) <buffer(0),1> Operator completed. status: APT_StatusOk  elapsed: 0.05  user: 0.00  sys: 0.00 (total CPU: 0.00)
##I TFPM 000326 00:15:10(002) <buffer(0),1> Heap growth during runLocally(): 0 bytes
##I TOLP 000163 00:15:10(001) <APT_LUTProcessOp in LK_CL,1> Input 0 consumed 28 records.
##I TOLP 000163 00:15:10(002) <APT_LUTProcessOp in LK_CL,1> Input 1 consumed 3 records.
##I TFLT 000163 00:15:10(003) <FLT_VAL_X_CL01,2> Input 0 consumed 9997 records.
##I TFLT 000094 00:15:10(004) <FLT_VAL_X_CL01,2> Output 0 produced 0 records.
##I TFPM 000325 00:15:10(005) <APT_CombinedOperatorController(1),2> Operator completed. status: APT_StatusOk  elapsed: 0.05  user: 0.02  sys: 0.01 (total CPU: 0.03)
##I TFPM 000325 00:15:10(003) <APT_LUTCreateOp in LK_CL,0> Operator completed. status: APT_StatusOk  elapsed: 0.05  user: 0.01  sys: 0.00 (total CPU: 0.01)
##I TFOP 000163 00:15:10(004) <APT_TransformOperatorImplV0S91_DS_CORE_BTFMSNCD_20_TR_BTFMSNCD_20 in TR_BTFMSNCD_20,0> Input 0 consumed 26 records.
##I TFOP 000094 00:15:10(005) <APT_TransformOperatorImplV0S91_DS_CORE_BTFMSNCD_20_TR_BTFMSNCD_20 in TR_BTFMSNCD_20,0> Output 0 produced 26 records.
##I TFPM 000326 00:15:10(004) <APT_LUTCreateOp in LK_CL,0> Heap growth during runLocally(): 196624 bytes
##I TFPM 000325 00:15:10(006) <APT_CombinedOperatorController(2),0> Operator completed. status: APT_StatusOk  elapsed: 0.05  user: 0.01  sys: 0.00 (total CPU: 0.01)
##I TFPM 000326 00:15:10(007) <APT_CombinedOperatorController(2),0> Heap growth during runLocally(): 0 bytes
##I TOLC 000163 00:15:10(001) <APT_LUTCreateOp in LK_CL,1> Input 0 consumed 2 records.
##I TOLC 000094 00:15:10(002) <APT_LUTCreateOp in LK_CL,1> Output 0 produced 1 records.
##I TOLC 000094 00:15:10(003) <APT_LUTCreateOp in LK_CL,1> Output 1 produced 0 records.
##I TOLP 000163 00:15:10(003) <APT_LUTProcessOp in LK_CL,1> Input 2 consumed 0 records.
##I TOLP 000094 00:15:10(004) <APT_LUTProcessOp in LK_CL,1> Output 0 produced 28 records.
##I TFPM 000325 00:15:10(004) <APT_LUTCreateOp in LK_CL,1> Operator completed. status: APT_StatusOk  elapsed: 0.05  user: 0.01  sys: 0.00 (total CPU: 0.01)
##I TFOP 000163 00:15:10(005) <APT_TransformOperatorImplV0S91_DS_CORE_BTFMSNCD_20_TR_BTFMSNCD_20 in TR_BTFMSNCD_20,1> Input 0 consumed 28 records.
##I TFOP 000094 00:15:10(006) <APT_TransformOperatorImplV0S91_DS_CORE_BTFMSNCD_20_TR_BTFMSNCD_20 in TR_BTFMSNCD_20,1> Output 0 produced 28 records.
##I TFPM 000326 00:15:10(005) <APT_LUTCreateOp in LK_CL,1> Heap growth during runLocally(): 196624 bytes
##I TFPM 000325 00:15:10(007) <APT_CombinedOperatorController(2),1> Operator completed. status: APT_StatusOk  elapsed: 0.04  user: 0.01  sys: 0.00 (total CPU: 0.01)
##I TFPM 000326 00:15:10(008) <APT_CombinedOperatorController(2),1> Heap growth during runLocally(): 0 bytes
##I TFPM 000325 00:15:10(001) <buffer(0),2> Operator completed. status: APT_StatusOk  elapsed: 0.05  user: 0.01  sys: 0.00 (total CPU: 0.01)
##I TFPM 000326 00:15:10(006) <APT_CombinedOperatorController(1),2> Heap growth during runLocally(): 0 bytes
##I TOLC 000163 00:15:10(001) <APT_LUTCreateOp in LK_CL,2> Input 0 consumed 2 records.
##I TOLC 000094 00:15:10(002) <APT_LUTCreateOp in LK_CL,2> Output 0 produced 1 records.
##I TOLC 000094 00:15:10(003) <APT_LUTCreateOp in LK_CL,2> Output 1 produced 0 records.
##I TFPM 000325 00:15:10(004) <APT_LUTCreateOp in LK_CL,2> Operator completed. status: APT_StatusOk  elapsed: 0.06  user: 0.01  sys: 0.00 (total CPU: 0.01)
##I TFPM 000326 00:15:10(005) <APT_LUTCreateOp in LK_CL,2> Heap growth during runLocally(): 0 bytes
##I TFPM 000326 00:15:10(002) <buffer(0),2> Heap growth during runLocally(): 0 bytes
##I TOLP 000163 00:15:10(001) <APT_LUTProcessOp in LK_CL,2> Input 0 consumed 25 records.
##I TOLP 000163 00:15:10(002) <APT_LUTProcessOp in LK_CL,2> Input 1 consumed 1 records.
##I TOLP 000163 00:15:10(003) <APT_LUTProcessOp in LK_CL,2> Input 2 consumed 0 records.
##I TOLP 000094 00:15:10(004) <APT_LUTProcessOp in LK_CL,2> Output 0 produced 25 records.
##I TFOP 000163 00:15:10(005) <APT_TransformOperatorImplV0S91_DS_CORE_BTFMSNCD_20_TR_BTFMSNCD_20 in TR_BTFMSNCD_20,2> Input 0 consumed 25 records.
##I TFOP 000094 00:15:10(006) <APT_TransformOperatorImplV0S91_DS_CORE_BTFMSNCD_20_TR_BTFMSNCD_20 in TR_BTFMSNCD_20,2> Output 0 produced 25 records.
##I TFPM 000325 00:15:10(007) <APT_CombinedOperatorController(2),2> Operator completed. status: APT_StatusOk  elapsed: 0.06  user: 0.02  sys: 0.00 (total CPU: 0.02)
##I TFPM 000326 00:15:10(008) <APT_CombinedOperatorController(2),2> Heap growth during runLocally(): 0 bytes
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There are no errors in what you have posted. Whence does it come? What has your official support provider had to say? Has the operating system changed along with the hardware upgrade and, if so, are all required patches in place for the new operating system version?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
peterxu
Participant
Posts: 8
Joined: Fri Sep 22, 2006 2:10 am

Post by peterxu »

Hi Ray, thanks for the quick response. Yes, there are no errors. It just hanged there with no reponse and, actually, it's already completed. That's why I feel it weird. And no OS configuration change.
Post Reply