Page 1 of 2

sendWriteSignal() failed on node LKF-PISRVENG01 ds = 7 consp

Posted: Wed Oct 29, 2014 10:49 am
by hexaware_tmk
Hi ,

We have a parallel job which fails with below error message , We do not know its failed because of bad design or some resource contention

Can anyone share some toughts please

It run daily But it fails only some times


Error Log:

Error Timestamp: 2014-10-24 10:08:33
Error Job Name: J_NS_CATAMARAN_HLTRANS834_EMP_LOAD_BLK

Error Job Path: \Jobs\PowerSTEPP\ELIGIBILITY\CATAMARAN_HLTRANS834_BLK\EXTRACT
Error Message: Unhandled abort encountered in job J_NS_CATAMARAN_HLTRANS834_EMP_LOAD_BLK

Job Log is as below:
7165\2014-10-24 10:08:15\1\\376\From previous run (...)
7166\2014-10-24 10:08:15\5\\377\Starting Job J_NS_CATAMARAN_HLTRANS834_EMP_LOAD_BLK. (...)
7167\2014-10-24 10:08:15\1\\377\Attached Message Handlers: (...)
7168\2014-10-24 10:08:16\1\\377\Environment variable settings: (...)
7169\2014-10-24 10:08:16\1\\377\Parallel job initiated
7170\2014-10-24 10:08:16\1\\377\OSH script (...)
7171\2014-10-24 10:08:18\1\\377\main_program: IBM WebSphere DataStage Enterprise Edition 8.5.0.6152 (...)
7172\2014-10-24 10:08:18\1\\377\main_program: conductor uname: -s=Windows_NT; -r=1; -v=6; -n=LKF-PISRVENG01; -m=Pentium
7173\2014-10-24 10:08:18\1\\377\main_program: orchgeneral: loaded (...)
7174\2014-10-24 10:08:20\1\\377\main_program: APT configuration file: D:/IBM/InformationServer/Server/Configurations/Node2.apt (...)
7175\2014-10-24 10:08:25\1\\377\main_program: This step has 23 datasets: (...)
7176\2014-10-24 10:08:25\3\\377\APT_CombinedOperatorController,1: Fatal Error: Caught unknown exception in player process: terminating.
7177\2014-10-24 10:08:25\3\\377
ode_node2: Player 7 terminated unexpectedly.
7178\2014-10-24 10:08:25\3\\377\main_program: APT_PMsectionLeader(2, node2), player 7 - Unexpected exit status 1.
7179\2014-10-24 10:08:25\3\\377
ode_node2: Player 4 terminated unexpectedly.
7180\2014-10-24 10:08:25\3\\377\main_program: APT_PMsectionLeader(2, node2), player 4 - Unexpected exit status 1.
7181\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: sendWriteSignal() failed on node LKF-PISRVENG01 ds = 7 conspart = 1 Broken pipe
7182\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: Write to dataset on [fd 16] failed (Error 0) on node node2, hostname LKF-PISRVENG01
7183\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: Block write failure. Partition: 1
7184\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: sendWriteSignal() failed on node LKF-PISRVENG01 ds = 7 conspart = 1 Broken pipe
7185\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: Write to dataset on [fd 16] failed (Error 0) on node node2, hostname LKF-PISRVENG01
7186\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: Block write failure. Partition: 1
7187\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: Internal Error: (shbuf): iomgr\iomgr.C: 1901
7188\2014-10-24 10:08:25\3\\377
ode_node2: Player 3 terminated unexpectedly.
7189\2014-10-24 10:08:25\3\\377\main_program: APT_PMsectionLeader(2, node2), player 3 - Unexpected exit status 1.
7190\2014-10-24 10:08:25\3\\377\LKP_MEMBER,0: sendWriteSignal() failed on node LKF-PISRVENG01 ds = 7 conspart = 1 Broken pipe
7191\2014-10-24 10:08:25\3\\377\LKP_MEMBER,0: Write to dataset on [fd 19] failed (Error 0) on node node1, hostname LKF-PISRVENG01
7192\2014-10-24 10:08:30\3\\377\LKP_MEMBER,0: Block write failure. Partition: 1
7193\2014-10-24 10:08:30\3\\377\main_program: Step execution finished with status = FAILED.
7194\2014-10-24 10:08:30\1\\377\main_program: Startup time, 0:12; production run time, 0:00.
7195\2014-10-24 10:08:30\1\\377\Contents of phantom output file (...)
7196\2014-10-24 10:08:31\5\\377\Job J_NS_CATAMARAN_HLTRANS834_EMP_LOAD_BLK aborted.
7197\2014-10-24 10:08:31\7\\377\(SEQ_J_NS_CATAMARAN_HLTRANS834_BLK) <- J_NS_CATAMARAN_HLTRANS834_EMP_LOAD_BLK: Job under control finished.

Posted: Wed Oct 29, 2014 11:20 am
by chulett
As a first guess with all those write failures - you ran out of space.

Posted: Wed Oct 29, 2014 2:02 pm
by Mike
I would concur that running out of disk space would be the first suspect.

Checking your disk space after a job abort might be too late.

Since it occurs only sometimes, you will need to monitor your disk usage continuously. Be sure to monitor resourcedisk, scratchdisk and temp space.

Mike

Posted: Thu Oct 30, 2014 12:38 am
by ray.wurlod
What's on node LKF-PISRVENG01 ?

Re: sendWriteSignal() failed on node LKF-PISRVENG01 ds = 7 c

Posted: Thu Oct 30, 2014 10:03 am
by hexaware_tmk
We have 2 node configuration file in production

Code: Select all

{
	node "node1"
	{
		fastname "LKF-PISRVENG01"
		pools ""
		resource disk "D:/IBM/InformationServer/Server/Datasets" {pools ""}
		resource scratchdisk "D:/IBM/InformationServer/Server/Scratch" {pools ""}
	}
	node "node2"
	{
		fastname "LKF-PISRVENG01"
		pools ""
		resource disk "D:/IBM/InformationServer/Server/Datasets1" {pools ""}
		resource scratchdisk "D:/IBM/InformationServer/Server/Scratch1" {pools ""}
	}
}

Posted: Thu Oct 30, 2014 10:29 am
by chulett
Okay... and?

Posted: Thu Oct 30, 2014 6:02 pm
by ray.wurlod
So both Data Sets and scratch disk for both partitions are on that machine. How full are the file systems now, and when the job runs?

Posted: Thu Nov 06, 2014 10:10 am
by hexaware_tmk
Only 50% of space used

TotalSpace Used Available Capacity
250 125.31 124.69 51

Posted: Thu Nov 06, 2014 11:53 am
by chulett
You need to also check while the job runs.

Posted: Thu Nov 06, 2014 4:40 pm
by hexaware_tmk
The 50% usage is the maximum utilization of memory when the job ran .we have monitored the space usage for one full day ,ita ranging between 40 to 50 %

Is there any other parameter /checks we have to do . This is happening randomly and we are not sure about the reason.

Since the server is Window ,Will be there be any problem between datastage and windows server in communicating

Posted: Thu Nov 06, 2014 6:31 pm
by ray.wurlod
You could try disabling operator combination so that, when next the error occurs, you'll have a more precise idea of which operator actually threw the error.

It's almost certainly related to running out of some kind of resource, or a timeout waiting for some resource (perhaps an answer from an external resource).

Posted: Fri Nov 07, 2014 8:46 am
by PaulVL
What is the value of TMPDIR?

How is that mount looking?

Posted: Mon Nov 10, 2014 2:25 pm
by hexaware_tmk
ray.wurlod wrote:You could try disabling operator combination so that, when next the error occurs, you'll have a more precise idea of which operator actually threw the error.

It's almost certainly related to running out of some kind of resource, or a timeout waiting for some resource (perhaps an answer from an external resource).

Can we set it for the whole project because we have 90 process running in production daily and this wrror is occuring one in 1 week or 2 weeks randomly in some jobs .

We dont have any job which is failing daily ,So can we set it project wide ? will there be any effect in the normal performence of a job?

Posted: Mon Nov 10, 2014 2:26 pm
by hexaware_tmk
PaulVL wrote:What is the value of TMPDIR?

How is that mount looking?
TMPDIR is also mounted in the same S drive , so it should be also withing that 50% usage

D:\temp

Posted: Mon Nov 10, 2014 4:01 pm
by ray.wurlod
hexaware_tmk wrote:Can we set it for the whole project because we have 90 process running in production daily and this wrror is occuring one in 1 week or 2 weeks randomly in some jobs .

We dont have any job which is failing daily ,So can we set it project wide ? will there be any effect in the normal performence of a job?
Yes you can, simply by changing the value of the APT_DISABLE_COMBINATION environment variable.

There will be minimal impact on the normal performance (which may even be an improvement in performance), as operators will no longer be combined into single processes, so that jobs may run with more processes than before.