Getting a "No space left on device" error
Moderators: chulett, rschirm, roy
Getting a "No space left on device" error
I am getting this error despite there being plenty of space on my box. The 2 volumes involved, /dspublic and /dsresource, have 64GB and 131GB of space. My job design is:
dataset--> standardization-->Transform1-->Transform2-->Merge-->
Transform-->Dataset
The dataset has 10million+ rows but aborts at 8.3million.
Any ideas?
dataset--> standardization-->Transform1-->Transform2-->Merge-->
Transform-->Dataset
The dataset has 10million+ rows but aborts at 8.3million.
Any ideas?
No, I haven't. I'll try that now. The specific error message I get is:
APT_CombinedOperatorController(0),0: Unsupported close in APT_FileBufferOutput::spillToNextFile(): No space left on device.
By the way, let's say, during the run, my /tmp is becoming completely full and the only way to get around the problem is by deleting stuff from /tmp. Wouldn't that affect my data?
APT_CombinedOperatorController(0),0: Unsupported close in APT_FileBufferOutput::spillToNextFile(): No space left on device.
By the way, let's say, during the run, my /tmp is becoming completely full and the only way to get around the problem is by deleting stuff from /tmp. Wouldn't that affect my data?
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Thanks Ray for your response.
What's interesting is, I monitored both scratch, resource and temp spaces throughout the run before the abort. All had no space change. The problem was being caused by the merge stage in my job. I split the job into two jobs. The first job writes all the inputs of the merge to disk. The next job loads these files and continues with the merge. Strangely enough, this worked.
How do I monitor specific file systems? I looked into the default config file and monitored the drives mentioned there. Since I am in a grid, should I monitor some other file systems?
What's interesting is, I monitored both scratch, resource and temp spaces throughout the run before the abort. All had no space change. The problem was being caused by the merge stage in my job. I split the job into two jobs. The first job writes all the inputs of the merge to disk. The next job loads these files and continues with the merge. Strangely enough, this worked.
How do I monitor specific file systems? I looked into the default config file and monitored the drives mentioned there. Since I am in a grid, should I monitor some other file systems?
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Since you are running in a grid environment, even you issue a df -k command, you will still only see the /dspbulic and /dsresource information on the Conductor node. No job is allowed to run on the Conductor node that's why you saw "NOTHING" changed when you were monitoring the job from that machine. You need to have root permission to isssue a command, qstat, from the PBSPro directory which will give you every piece information of your job, queue, and server.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact: