Page 1 of 1

Jobs aborted with "Write to dataset failed"

Posted: Tue Dec 08, 2009 9:38 am
by Flyerman_2
Hi,

Datastage 8.0.1
OS: AIX 5.3.0.0

When trying to write to a dataset, I'm getting the following errors:

########################
FATAL :
########################
APT_CombinedOperatorController(7),4: Write to dataset on [fd 17] failed (Error 0) on node node5, hostname <Server name>
APT_CombinedOperatorController(7),4: Orchestrate was unable to write to any of the following files:
APT_CombinedOperatorController(7),4: /DataStage/data/<filename>
APT_CombinedOperatorController(7),0: Write to dataset on [fd 17] failed (Error 0) on node node1, hostname <Server name>
APT_CombinedOperatorController(7),0: Orchestrate was unable to write to any of the following files:
APT_CombinedOperatorController(7),0: /DataStage/data/<filename>
APT_CombinedOperatorController(7),4: Block write failure. Partition: 4
<Filename>,4: Failure during execution of operator logic.
APT_CombinedOperatorController(7),4: Fatal Error: File data set, file "/DataStage/data/<Filename>.ds".; output of "<Filename>": DM getOutputRecord error.
APT_CombinedOperatorController(7),0: Block write failure. Partition: 0
<Filename>,0: Failure during execution of operator logic.
APT_CombinedOperatorController(7),0: Fatal Error: File data set, file "/DataStage/data/<Filename>.ds".; output of "<Filename>": DM getOutputRecord error.
node_node1: Player 67 terminated unexpectedly.
node_node5: Player 64 terminated unexpectedly.
main_program: APT_PMsectionLeader(1, node1), player 67 - Unexpected exit status 1.
<Filename 2>,0: Failure during execution of operator logic.
<Filename 2>,0: Fatal Error: Unable to allocate communication resources
node_node1: Player 42 terminated unexpectedly.
main_program: APT_PMsectionLeader(1, node1), player 42 - Unexpected exit status 1. (...)
<Filename 2>,4: Failure during execution of operator logic.
<Filename 2>,4: Fatal Error: Unable to allocate communication resources
main_program: Step execution finished with status = FAILED.
########################

Failed to execute job :<Job Name>. Return Code : 16

In the same log, we see also:
Message:: main_program: The open files limit is 2000; raising to 2147483647.
I do not know if it is normal.

Another log in /DataStage/MetaData/<project_name>/&PH&/ gives
"DataStage Job 1035 Phantom 20950
readSocket() returned 16
DataStage Phantom Finished."
The Setting is unchanged. We have unix rights in the directories.


We have now this problem in 3 servers (2 of Production), with the same error message always in an old job.

We found by replacing a Join processing by a Lookup that worked fine but the issue was on the next job. :(
All these jobs worked for a long time. We have too many jobs and Lookup could not be always implemented.

We are looking for ulimit parameters:

We have the same values for all 3 servers

from Unix Box
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) 4194304
memory(kbytes) unlimited
coredump(blocks) 2097151
nofiles(descriptors) unlimited

but from SH -c "ulimit -a" (DataStage Administrator)
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) 1572864
stack(kbytes) 4194304
memory(kbytes) unlimited
coredump(blocks) 0
nofiles(descriptors) unlimited

We can note we have 2 differences between the 2 commands. I do not know why ?

For information, in dsenv script we added months ago:
ulimit -d unlimited
ulimit -m unlimited
# ulimit -s unlimited
ulimit -f unlimited

Nothing else in DSPARAMS.

--------

This is <Server name>.apt file

{
node "node1"
{
fastname "<Server name>"
pools ""
resource disk "/DataStage/data/PX1/<project name>/DS" {pools ""}
resource scratchdisk "/DataStage/data/PX1/<project name>/SCRATCH" {pools ""}
}
... node "node6"
{
fastname "<Server name>"
pools ""
resource disk "/DataStage/data/PX6/<project name>/DS" {pools ""}
resource scratchdisk "/DataStage/data/PX6/<project name>/SCRATCH" {pools ""}
}
}

We have enough disk space, we verify File System during the run of the job, no significative evolution.

We have 6 File System, one by node, with more than 30Gb free by FS.

We check up also tmp directory: no problem of disk space.


Do you have some ideas ?

Thanks for your help.

Posted: Tue Dec 08, 2009 9:43 am
by chulett
Seems to me that a "Block write failure" is either because the disk is full or you have a media error / bad block / hardware issue. You monitored the space while the jobs ran and the error was generated?

Also, is your O/S 32bit or 64bit?

Posted: Wed Dec 09, 2009 3:51 am
by Flyerman_2
First, thank you for your help.

O/S is 64bit.

Yes, i monitored the space while the jobs ran and the error was generated. Nothing significant. The job failed after a little more than 1 minute.

What it is strange, is we have the same problem at the same moment in 3 differents servers not in the same place.

Some days before, we update in all servers our script of backup and restore. We just added the STOP and START for ASB Node like in the "InfoSphere Information Server Administration guide".
The return code is 0. So it seems ok.
And i do not see why this update could geneate this error.

Posted: Wed Dec 09, 2009 7:59 am
by chulett
Interesting that I'm thanked for my help and yet my attempt to help is rated as 'off-topic/superfluous'. Nice. :?

I ask re: the 'bitness' of your O/S as I've seen issues like this in a 32bit environment that did not occur in a 64bit one. I can't imagine any changes to your backup script would generate this error unless someone decided to run/test it while jobs were running. Speaking of which, what the heck does this mean?

"What it is strange, is we have the same problem at the same moment in 3 differents servers not in the same place."

Three different servers not in the same place? Are you saying this happened simultaneously on three different physical pieces of hardware? :shock:

Posted: Sun Dec 13, 2009 12:00 pm
by sjfearnside
I am experiencing this problem now, did you solve it? if so, what was the solution?

Posted: Sun Dec 13, 2009 3:18 pm
by chulett
Which problem, exactly? Block write failure? At the same moment in 3 differents servers not in the same place?

Posted: Sun Dec 13, 2009 4:52 pm
by sjfearnside
Write to dataset on [fd 17] failed (Error 0) on node node5, hostname <Server name>

Posted: Wed May 04, 2011 10:09 am
by Nagaraj
Any other ideas to get around this block write failure.

Posted: Wed May 04, 2011 10:38 am
by chulett
You should start your own post if you are having a similar problem.

Posted: Wed May 04, 2011 5:57 pm
by Nagaraj
chulett wrote:You should start your own post if you are having a similar problem.
i just thought this thread is still open and i will continue and mark this thread as resolved or workaround.

:)

Posted: Wed May 04, 2011 7:31 pm
by chulett
That's the problem - you can't. It's not your thread.

Posted: Mon May 09, 2011 10:35 am
by fridge
it may be worth checking if the dataset did get written too at all - check the size of the dataset files in the .../DataSets directory

reason I say this is that hit a problem some years ago where our dataset portions were throwing up a simalar error and after checking diskspace and filelimits on the user - I checked the size and each segment was failing at 512 bytes off 1gb

The issue was actually to do with the PX set up - I cant I am afaid remember the exact details - but was to do with the Memory Model - it was explained to me that the executables have x amount of memory to address and this can be configured y bytes for data, z bytes 'code' and so on. (to be honest my sysadmin got half way though this and I dozed off but you get the idea) - it was a simple unix command to change the configuration - the command syntax was supplied by Ascential (pre-IBM)

I know the above isnt a solution (if you havent solved it already) - but if you check the size as suggested and have simalar symptom (512 bytes off 1gb) will try to dig out my notes

this is resolved after changing to configuration file config

Posted: Tue Oct 25, 2016 1:51 am
by ulab
thisissue got resolved after changing to configuration file config.apt