sendWriteSignal() Could not send close message

aramachandra · Post by **aramachandra** » Mon Oct 10, 2005 4:30 pm

I get this error after running my parellel job. The job runs for a few hours before it gets this error.

Is this related to memory limitation reached or some kind of resource limits reached

Are there any solutions in general or is this more of a job tunning aspect that needs to evaluated on a case by case basis.

Same director log

Item #: 30
Event ID: 29
Timestamp: 2005-10-10 11:45:16
Type: Fatal
User Name: aramacha
Message: contDwidLookup.lkpValuationClass,0: sendWriteSignal() failed on node admin-srv306 ds = 13 conspart = 0 Broken pipe

Item #: 31
Event ID: 30
Timestamp: 2005-10-10 11:45:16
Type: Warning
User Name: aramacha
Message: contDwidLookup.lkpValuationClass,0: Could not send close message (shared memory)

Item #: 32
Event ID: 31
Timestamp: 2005-10-10 11:45:16
Type: Fatal
User Name: aramacha
Message: contDwidLookup.lkpPlant,0: sendWriteSignal() failed on node admin-srv306 ds = 19 conspart = 0 Broken pipe

Item #: 33
Event ID: 32
Timestamp: 2005-10-10 11:45:16
Type: Warning
User Name: aramacha
Message: contDwidLookup.lkpPlant,0: Could not send close message (shared memory)

ArndW · Post by **ArndW** » Tue Oct 11, 2005 1:16 am

Aramachandra,

PX jobs start a lot of UNIX processes, each does a portion of the work. The jobs communicate with each using pipes. Each pipe has a writer and a reader, if one of the two unexpectedly terminates the connection (a close or the job aborts) then the other will report a "broken pipe". Sadly, the error message is a just an effect of the real problem - which is that the other half of the process has disappeared and it is unlikely that this one will write a friendly message to a log file "I have encountered a serious error and am about to crash, but in my last dying cpu cycles I am writing my message to the log files..."

This makes it a bit more difficult to locate the cause.

In your case the job ran for a while, so the basic design is most likely sound. The most common culprit here is the lack of disk space.

If you reset your job, do you get additional messages in your log file? What kind of processing does your job do? Could one of the resources be taken away from you during the run (DBA stops the database, remote NFS files are dismounted)? Is this error reproduceable either at the same row numbr or after the same amount of time?

aramachandra · Post by **aramachandra** » Tue Oct 11, 2005 8:20 am

The "broken pipe" situation happened at different stages in the job and it "DOES NOT" necessarily happen consistently after some row count. The problme is there is not consistent pattern to when it will occur.

I am breaking down the parellel job into multiple small jobs and using datasets to kind of connect the jobs.

Last night i ran subset of the job and after it failed i just let i run again before going home only to see it succeed this time. The reasoning could be that during evening hours the load is relatively less on the development box. But i changed a few other parameters like the director it lands dataset etc so not really sure if that did the trick.

The admin or the DBA did not yank out any resources like you mentioned as far as i know.

Since this is a development box and as the test and prod environment are much beefier i hope this will not be a resource issue on those boxes.

Arvind

aramachandra · Post by **aramachandra** » Tue Oct 11, 2005 8:21 am

The job reads from data tables representing SAP data and does a few basic transforms to get the dwids from the dimension and builds the fact table.

arvind

track_star · Post by **track_star** » Tue Oct 11, 2005 3:05 pm

What operators do those error messages correspond to? If I had to guess, they are lookups....is that correct or are they database operators of some kind?