sendWriteSignal() failed on node LKF-PISRVENG01 ds = 7 consp

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

hexaware_tmk
Premium Member
Premium Member
Posts: 17
Joined: Wed Mar 19, 2014 3:53 pm

sendWriteSignal() failed on node LKF-PISRVENG01 ds = 7 consp

Post by hexaware_tmk »

Hi ,

We have a parallel job which fails with below error message , We do not know its failed because of bad design or some resource contention

Can anyone share some toughts please

It run daily But it fails only some times


Error Log:

Error Timestamp: 2014-10-24 10:08:33
Error Job Name: J_NS_CATAMARAN_HLTRANS834_EMP_LOAD_BLK

Error Job Path: \Jobs\PowerSTEPP\ELIGIBILITY\CATAMARAN_HLTRANS834_BLK\EXTRACT
Error Message: Unhandled abort encountered in job J_NS_CATAMARAN_HLTRANS834_EMP_LOAD_BLK

Job Log is as below:
7165\2014-10-24 10:08:15\1\\376\From previous run (...)
7166\2014-10-24 10:08:15\5\\377\Starting Job J_NS_CATAMARAN_HLTRANS834_EMP_LOAD_BLK. (...)
7167\2014-10-24 10:08:15\1\\377\Attached Message Handlers: (...)
7168\2014-10-24 10:08:16\1\\377\Environment variable settings: (...)
7169\2014-10-24 10:08:16\1\\377\Parallel job initiated
7170\2014-10-24 10:08:16\1\\377\OSH script (...)
7171\2014-10-24 10:08:18\1\\377\main_program: IBM WebSphere DataStage Enterprise Edition 8.5.0.6152 (...)
7172\2014-10-24 10:08:18\1\\377\main_program: conductor uname: -s=Windows_NT; -r=1; -v=6; -n=LKF-PISRVENG01; -m=Pentium
7173\2014-10-24 10:08:18\1\\377\main_program: orchgeneral: loaded (...)
7174\2014-10-24 10:08:20\1\\377\main_program: APT configuration file: D:/IBM/InformationServer/Server/Configurations/Node2.apt (...)
7175\2014-10-24 10:08:25\1\\377\main_program: This step has 23 datasets: (...)
7176\2014-10-24 10:08:25\3\\377\APT_CombinedOperatorController,1: Fatal Error: Caught unknown exception in player process: terminating.
7177\2014-10-24 10:08:25\3\\377
ode_node2: Player 7 terminated unexpectedly.
7178\2014-10-24 10:08:25\3\\377\main_program: APT_PMsectionLeader(2, node2), player 7 - Unexpected exit status 1.
7179\2014-10-24 10:08:25\3\\377
ode_node2: Player 4 terminated unexpectedly.
7180\2014-10-24 10:08:25\3\\377\main_program: APT_PMsectionLeader(2, node2), player 4 - Unexpected exit status 1.
7181\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: sendWriteSignal() failed on node LKF-PISRVENG01 ds = 7 conspart = 1 Broken pipe
7182\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: Write to dataset on [fd 16] failed (Error 0) on node node2, hostname LKF-PISRVENG01
7183\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: Block write failure. Partition: 1
7184\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: sendWriteSignal() failed on node LKF-PISRVENG01 ds = 7 conspart = 1 Broken pipe
7185\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: Write to dataset on [fd 16] failed (Error 0) on node node2, hostname LKF-PISRVENG01
7186\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: Block write failure. Partition: 1
7187\2014-10-24 10:08:25\3\\377\LKP_MEMBER,1: Internal Error: (shbuf): iomgr\iomgr.C: 1901
7188\2014-10-24 10:08:25\3\\377
ode_node2: Player 3 terminated unexpectedly.
7189\2014-10-24 10:08:25\3\\377\main_program: APT_PMsectionLeader(2, node2), player 3 - Unexpected exit status 1.
7190\2014-10-24 10:08:25\3\\377\LKP_MEMBER,0: sendWriteSignal() failed on node LKF-PISRVENG01 ds = 7 conspart = 1 Broken pipe
7191\2014-10-24 10:08:25\3\\377\LKP_MEMBER,0: Write to dataset on [fd 19] failed (Error 0) on node node1, hostname LKF-PISRVENG01
7192\2014-10-24 10:08:30\3\\377\LKP_MEMBER,0: Block write failure. Partition: 1
7193\2014-10-24 10:08:30\3\\377\main_program: Step execution finished with status = FAILED.
7194\2014-10-24 10:08:30\1\\377\main_program: Startup time, 0:12; production run time, 0:00.
7195\2014-10-24 10:08:30\1\\377\Contents of phantom output file (...)
7196\2014-10-24 10:08:31\5\\377\Job J_NS_CATAMARAN_HLTRANS834_EMP_LOAD_BLK aborted.
7197\2014-10-24 10:08:31\7\\377\(SEQ_J_NS_CATAMARAN_HLTRANS834_BLK) <- J_NS_CATAMARAN_HLTRANS834_EMP_LOAD_BLK: Job under control finished.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

As a first guess with all those write failures - you ran out of space.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

I would concur that running out of disk space would be the first suspect.

Checking your disk space after a job abort might be too late.

Since it occurs only sometimes, you will need to monitor your disk usage continuously. Be sure to monitor resourcedisk, scratchdisk and temp space.

Mike
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What's on node LKF-PISRVENG01 ?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
hexaware_tmk
Premium Member
Premium Member
Posts: 17
Joined: Wed Mar 19, 2014 3:53 pm

Re: sendWriteSignal() failed on node LKF-PISRVENG01 ds = 7 c

Post by hexaware_tmk »

We have 2 node configuration file in production

Code: Select all

{
	node "node1"
	{
		fastname "LKF-PISRVENG01"
		pools ""
		resource disk "D:/IBM/InformationServer/Server/Datasets" {pools ""}
		resource scratchdisk "D:/IBM/InformationServer/Server/Scratch" {pools ""}
	}
	node "node2"
	{
		fastname "LKF-PISRVENG01"
		pools ""
		resource disk "D:/IBM/InformationServer/Server/Datasets1" {pools ""}
		resource scratchdisk "D:/IBM/InformationServer/Server/Scratch1" {pools ""}
	}
}
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Okay... and?
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

So both Data Sets and scratch disk for both partitions are on that machine. How full are the file systems now, and when the job runs?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
hexaware_tmk
Premium Member
Premium Member
Posts: 17
Joined: Wed Mar 19, 2014 3:53 pm

Post by hexaware_tmk »

Only 50% of space used

TotalSpace Used Available Capacity
250 125.31 124.69 51
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

You need to also check while the job runs.
-craig

"You can never have too many knives" -- Logan Nine Fingers
hexaware_tmk
Premium Member
Premium Member
Posts: 17
Joined: Wed Mar 19, 2014 3:53 pm

Post by hexaware_tmk »

The 50% usage is the maximum utilization of memory when the job ran .we have monitored the space usage for one full day ,ita ranging between 40 to 50 %

Is there any other parameter /checks we have to do . This is happening randomly and we are not sure about the reason.

Since the server is Window ,Will be there be any problem between datastage and windows server in communicating
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You could try disabling operator combination so that, when next the error occurs, you'll have a more precise idea of which operator actually threw the error.

It's almost certainly related to running out of some kind of resource, or a timeout waiting for some resource (perhaps an answer from an external resource).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

What is the value of TMPDIR?

How is that mount looking?
hexaware_tmk
Premium Member
Premium Member
Posts: 17
Joined: Wed Mar 19, 2014 3:53 pm

Post by hexaware_tmk »

ray.wurlod wrote:You could try disabling operator combination so that, when next the error occurs, you'll have a more precise idea of which operator actually threw the error.

It's almost certainly related to running out of some kind of resource, or a timeout waiting for some resource (perhaps an answer from an external resource).

Can we set it for the whole project because we have 90 process running in production daily and this wrror is occuring one in 1 week or 2 weeks randomly in some jobs .

We dont have any job which is failing daily ,So can we set it project wide ? will there be any effect in the normal performence of a job?
hexaware_tmk
Premium Member
Premium Member
Posts: 17
Joined: Wed Mar 19, 2014 3:53 pm

Post by hexaware_tmk »

PaulVL wrote:What is the value of TMPDIR?

How is that mount looking?
TMPDIR is also mounted in the same S drive , so it should be also withing that 50% usage

D:\temp
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

hexaware_tmk wrote:Can we set it for the whole project because we have 90 process running in production daily and this wrror is occuring one in 1 week or 2 weeks randomly in some jobs .

We dont have any job which is failing daily ,So can we set it project wide ? will there be any effect in the normal performence of a job?
Yes you can, simply by changing the value of the APT_DISABLE_COMBINATION environment variable.

There will be minimal impact on the normal performance (which may even be an improvement in performance), as operators will no longer be combined into single processes, so that jobs may run with more processes than before.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply