sendWriteSignal() failed on node LKF-PISRVENG01 ds = 7 consp
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 17
- Joined: Wed Mar 19, 2014 3:53 pm
sendWriteSignal() failed on node LKF-PISRVENG01 ds = 7 consp
Hi ,
We have a parallel job which fails with below error message , We do not know its failed because of bad design or some resource contention
Can anyone share some toughts please
It run daily But it fails only some times
Error Log:
Error Timestamp: 2014-10-24 10:08:33
Error Job Name: J_NS_CATAMARAN_HLTRANS834_EMP_LOAD_BLK
Error Job Path: \Jobs\PowerSTEPP\ELIGIBILITY\CATAMARAN_HLTRANS834_BLK\EXTRACT
Error Message: Unhandled abort encountered in job J_NS_CATAMARAN_HLTRANS834_EMP_LOAD_BLK
Job Log is as below:
7165\2014-10-24 10:08:15\1\\376\From previous run (...)
7166\2014-10-24 10:08:15\5\\377\Starting Job J_NS_CATAMARAN_HLTRANS834_EMP_LOAD_BLK. (...)
7167\2014-10-24 10:08:15\1\\377\Attached Message Handlers: (...)
7168\2014-10-24 10:08:16\1\\377\Environment variable settings: (...)
7169\2014-10-24 10:08:16\1\\377\Parallel job initiated
7170\2014-10-24 10:08:16\1\\377\OSH script (...)
7171\2014-10-24 10:08:18\1\\377\main_program: IBM WebSphere DataStage Enterprise Edition 8.5.0.6152 (...)
7172\2014-10-24 10:08:18\1\\377\main_program: conductor uname: -s=Windows_NT; -r=1; -v=6; -n=LKF-PISRVENG01; -m=Pentium
7173\2014-10-24 10:08:18\1\\377\main_program: orchgeneral: loaded (...)
7174\2014-10-24 10:08:20\1\\377\main_program: APT configuration file: D:/IBM/InformationServer/Server/Configurations/Node2.apt (...)
7175\2014-10-24 10:08:25\1\\377\main_program: This step has 23 datasets: (...)
7176\2014-10-24 10:08:25\3\\377\APT_CombinedOperatorController,1: Fatal Error: Caught unknown exception in player process: terminating.
7177\2014-10-24 10:08:25\3\\377
ode_node2: Player 7 terminated unexpectedly.
7178\2014-10-24 10:08:25\3\\377\main_program: APT_PMsectionLeader(2, node2), player 7 - Unexpected exit status 1.
7179\2014-10-24 10:08:25\3\\377
ode_node2: Player 4 terminated unexpectedly.
7180\2014-10-24 10:08:25\3\\377\main_program: APT_PMsectionLeader(2, node2), player 4 - Unexpected exit status 1.
7181\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: sendWriteSignal() failed on node LKF-PISRVENG01 ds = 7 conspart = 1 Broken pipe
7182\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: Write to dataset on [fd 16] failed (Error 0) on node node2, hostname LKF-PISRVENG01
7183\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: Block write failure. Partition: 1
7184\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: sendWriteSignal() failed on node LKF-PISRVENG01 ds = 7 conspart = 1 Broken pipe
7185\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: Write to dataset on [fd 16] failed (Error 0) on node node2, hostname LKF-PISRVENG01
7186\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: Block write failure. Partition: 1
7187\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: Internal Error: (shbuf): iomgr\iomgr.C: 1901
7188\2014-10-24 10:08:25\3\\377
ode_node2: Player 3 terminated unexpectedly.
7189\2014-10-24 10:08:25\3\\377\main_program: APT_PMsectionLeader(2, node2), player 3 - Unexpected exit status 1.
7190\2014-10-24 10:08:25\3\\377\LKP_MEMBER,0: sendWriteSignal() failed on node LKF-PISRVENG01 ds = 7 conspart = 1 Broken pipe
7191\2014-10-24 10:08:25\3\\377\LKP_MEMBER,0: Write to dataset on [fd 19] failed (Error 0) on node node1, hostname LKF-PISRVENG01
7192\2014-10-24 10:08:30\3\\377\LKP_MEMBER,0: Block write failure. Partition: 1
7193\2014-10-24 10:08:30\3\\377\main_program: Step execution finished with status = FAILED.
7194\2014-10-24 10:08:30\1\\377\main_program: Startup time, 0:12; production run time, 0:00.
7195\2014-10-24 10:08:30\1\\377\Contents of phantom output file (...)
7196\2014-10-24 10:08:31\5\\377\Job J_NS_CATAMARAN_HLTRANS834_EMP_LOAD_BLK aborted.
7197\2014-10-24 10:08:31\7\\377\(SEQ_J_NS_CATAMARAN_HLTRANS834_BLK) <- J_NS_CATAMARAN_HLTRANS834_EMP_LOAD_BLK: Job under control finished.
We have a parallel job which fails with below error message , We do not know its failed because of bad design or some resource contention
Can anyone share some toughts please
It run daily But it fails only some times
Error Log:
Error Timestamp: 2014-10-24 10:08:33
Error Job Name: J_NS_CATAMARAN_HLTRANS834_EMP_LOAD_BLK
Error Job Path: \Jobs\PowerSTEPP\ELIGIBILITY\CATAMARAN_HLTRANS834_BLK\EXTRACT
Error Message: Unhandled abort encountered in job J_NS_CATAMARAN_HLTRANS834_EMP_LOAD_BLK
Job Log is as below:
7165\2014-10-24 10:08:15\1\\376\From previous run (...)
7166\2014-10-24 10:08:15\5\\377\Starting Job J_NS_CATAMARAN_HLTRANS834_EMP_LOAD_BLK. (...)
7167\2014-10-24 10:08:15\1\\377\Attached Message Handlers: (...)
7168\2014-10-24 10:08:16\1\\377\Environment variable settings: (...)
7169\2014-10-24 10:08:16\1\\377\Parallel job initiated
7170\2014-10-24 10:08:16\1\\377\OSH script (...)
7171\2014-10-24 10:08:18\1\\377\main_program: IBM WebSphere DataStage Enterprise Edition 8.5.0.6152 (...)
7172\2014-10-24 10:08:18\1\\377\main_program: conductor uname: -s=Windows_NT; -r=1; -v=6; -n=LKF-PISRVENG01; -m=Pentium
7173\2014-10-24 10:08:18\1\\377\main_program: orchgeneral: loaded (...)
7174\2014-10-24 10:08:20\1\\377\main_program: APT configuration file: D:/IBM/InformationServer/Server/Configurations/Node2.apt (...)
7175\2014-10-24 10:08:25\1\\377\main_program: This step has 23 datasets: (...)
7176\2014-10-24 10:08:25\3\\377\APT_CombinedOperatorController,1: Fatal Error: Caught unknown exception in player process: terminating.
7177\2014-10-24 10:08:25\3\\377
ode_node2: Player 7 terminated unexpectedly.
7178\2014-10-24 10:08:25\3\\377\main_program: APT_PMsectionLeader(2, node2), player 7 - Unexpected exit status 1.
7179\2014-10-24 10:08:25\3\\377
ode_node2: Player 4 terminated unexpectedly.
7180\2014-10-24 10:08:25\3\\377\main_program: APT_PMsectionLeader(2, node2), player 4 - Unexpected exit status 1.
7181\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: sendWriteSignal() failed on node LKF-PISRVENG01 ds = 7 conspart = 1 Broken pipe
7182\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: Write to dataset on [fd 16] failed (Error 0) on node node2, hostname LKF-PISRVENG01
7183\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: Block write failure. Partition: 1
7184\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: sendWriteSignal() failed on node LKF-PISRVENG01 ds = 7 conspart = 1 Broken pipe
7185\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: Write to dataset on [fd 16] failed (Error 0) on node node2, hostname LKF-PISRVENG01
7186\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: Block write failure. Partition: 1
7187\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: Internal Error: (shbuf): iomgr\iomgr.C: 1901
7188\2014-10-24 10:08:25\3\\377
ode_node2: Player 3 terminated unexpectedly.
7189\2014-10-24 10:08:25\3\\377\main_program: APT_PMsectionLeader(2, node2), player 3 - Unexpected exit status 1.
7190\2014-10-24 10:08:25\3\\377\LKP_MEMBER,0: sendWriteSignal() failed on node LKF-PISRVENG01 ds = 7 conspart = 1 Broken pipe
7191\2014-10-24 10:08:25\3\\377\LKP_MEMBER,0: Write to dataset on [fd 19] failed (Error 0) on node node1, hostname LKF-PISRVENG01
7192\2014-10-24 10:08:30\3\\377\LKP_MEMBER,0: Block write failure. Partition: 1
7193\2014-10-24 10:08:30\3\\377\main_program: Step execution finished with status = FAILED.
7194\2014-10-24 10:08:30\1\\377\main_program: Startup time, 0:12; production run time, 0:00.
7195\2014-10-24 10:08:30\1\\377\Contents of phantom output file (...)
7196\2014-10-24 10:08:31\5\\377\Job J_NS_CATAMARAN_HLTRANS834_EMP_LOAD_BLK aborted.
7197\2014-10-24 10:08:31\7\\377\(SEQ_J_NS_CATAMARAN_HLTRANS834_BLK) <- J_NS_CATAMARAN_HLTRANS834_EMP_LOAD_BLK: Job under control finished.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Premium Member
- Posts: 17
- Joined: Wed Mar 19, 2014 3:53 pm
Re: sendWriteSignal() failed on node LKF-PISRVENG01 ds = 7 c
We have 2 node configuration file in production
Code: Select all
{
node "node1"
{
fastname "LKF-PISRVENG01"
pools ""
resource disk "D:/IBM/InformationServer/Server/Datasets" {pools ""}
resource scratchdisk "D:/IBM/InformationServer/Server/Scratch" {pools ""}
}
node "node2"
{
fastname "LKF-PISRVENG01"
pools ""
resource disk "D:/IBM/InformationServer/Server/Datasets1" {pools ""}
resource scratchdisk "D:/IBM/InformationServer/Server/Scratch1" {pools ""}
}
}
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Premium Member
- Posts: 17
- Joined: Wed Mar 19, 2014 3:53 pm
-
- Premium Member
- Posts: 17
- Joined: Wed Mar 19, 2014 3:53 pm
The 50% usage is the maximum utilization of memory when the job ran .we have monitored the space usage for one full day ,ita ranging between 40 to 50 %
Is there any other parameter /checks we have to do . This is happening randomly and we are not sure about the reason.
Since the server is Window ,Will be there be any problem between datastage and windows server in communicating
Is there any other parameter /checks we have to do . This is happening randomly and we are not sure about the reason.
Since the server is Window ,Will be there be any problem between datastage and windows server in communicating
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
You could try disabling operator combination so that, when next the error occurs, you'll have a more precise idea of which operator actually threw the error.
It's almost certainly related to running out of some kind of resource, or a timeout waiting for some resource (perhaps an answer from an external resource).
It's almost certainly related to running out of some kind of resource, or a timeout waiting for some resource (perhaps an answer from an external resource).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 17
- Joined: Wed Mar 19, 2014 3:53 pm
ray.wurlod wrote:You could try disabling operator combination so that, when next the error occurs, you'll have a more precise idea of which operator actually threw the error.
It's almost certainly related to running out of some kind of resource, or a timeout waiting for some resource (perhaps an answer from an external resource).
Can we set it for the whole project because we have 90 process running in production daily and this wrror is occuring one in 1 week or 2 weeks randomly in some jobs .
We dont have any job which is failing daily ,So can we set it project wide ? will there be any effect in the normal performence of a job?
-
- Premium Member
- Posts: 17
- Joined: Wed Mar 19, 2014 3:53 pm
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Yes you can, simply by changing the value of the APT_DISABLE_COMBINATION environment variable.hexaware_tmk wrote:Can we set it for the whole project because we have 90 process running in production daily and this wrror is occuring one in 1 week or 2 weeks randomly in some jobs .
We dont have any job which is failing daily ,So can we set it project wide ? will there be any effect in the normal performence of a job?
There will be minimal impact on the normal performance (which may even be an improvement in performance), as operators will no longer be combined into single processes, so that jobs may run with more processes than before.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.