Getting a "No space left on device" error

abc123 · Post by **abc123** » Fri Dec 21, 2007 9:18 am

I am getting this error despite there being plenty of space on my box. The 2 volumes involved, /dspublic and /dsresource, have 64GB and 131GB of space. My job design is:

dataset--> standardization-->Transform1-->Transform2-->Merge-->
Transform-->Dataset

The dataset has 10million+ rows but aborts at 8.3million.

Any ideas?

ArndW · Post by **ArndW** » Fri Dec 21, 2007 9:20 am

Have you monitored /tmp (and the other drives) while the job is running to see if any grows to 100% and, after the jobs abort, shrinks rapidly again?

abc123 · Post by **abc123** » Fri Dec 21, 2007 9:30 am

No, I haven't. I'll try that now. The specific error message I get is:

APT_CombinedOperatorController(0),0: Unsupported close in APT_FileBufferOutput::spillToNextFile(): No space left on device.

By the way, let's say, during the run, my /tmp is becoming completely full and the only way to get around the problem is by deleting stuff from /tmp. Wouldn't that affect my data?

ds_user78 · Post by **ds_user78** » Fri Dec 21, 2007 9:30 am

Check the scratch disk and resource disk space - you shd get this in the config file. Ensure that enough space is there in these paths.

abc123 · Post by **abc123** » Fri Dec 21, 2007 9:33 am

By the way, I am working in a grid environment.

ds_user78 · Post by **ds_user78** » Fri Dec 21, 2007 9:38 am

Even in the grid env , you should be able to see the resource disk and resource scratchdisk on the dynamic configuration file from the director log. ensure that space is there on those paths.

abc123 · Post by **abc123** » Fri Dec 21, 2007 9:57 am

My scratch disk is empty but the temp space has 9.5 GB of stuff. Not a whole lot but can be cleared. I'll run the job and monitor the temp space during the run.

abc123 · Post by **abc123** » Fri Dec 21, 2007 11:17 am

I am running the job now. The job is into about 3 million rows and I see no difference in the scratch, resource and temp disk spaces. I had taken a snapshot in the beginning.

abc123 · Post by **abc123** » Fri Dec 21, 2007 12:59 pm

My job aborted again. I monitored throughout. It apparently had no effect on scratch, resource or temp space.

ray.wurlod · Post by **ray.wurlod** » Fri Dec 21, 2007 1:31 pm

So which disk filled? Looks like you need to monitor all file systems.

abc123 · Post by **abc123** » Sat Dec 22, 2007 2:34 pm

Thanks Ray for your response.

What's interesting is, I monitored both scratch, resource and temp spaces throughout the run before the abort. All had no space change. The problem was being caused by the merge stage in my job. I split the job into two jobs. The first job writes all the inputs of the merge to disk. The next job loads these files and continues with the merge. Strangely enough, this worked.

How do I monitor specific file systems? I looked into the default config file and monitored the drives mentioned there. Since I am in a grid, should I monitor some other file systems?

ray.wurlod · Post by **ray.wurlod** » Sat Dec 22, 2007 3:49 pm

Yes you should monitor ALL file systems (use df -k command repeatedly) because you don't know which one is filling.

lstsaur · Post by **lstsaur** » Sat Dec 22, 2007 5:59 pm

Since you are running in a grid environment, even you issue a df -k command, you will still only see the /dspbulic and /dsresource information on the Conductor node. No job is allowed to run on the Conductor node that's why you saw "NOTHING" changed when you were monitoring the job from that machine. You need to have root permission to isssue a command, qstat, from the PBSPro directory which will give you every piece information of your job, queue, and server.

ray.wurlod · Post by **ray.wurlod** » Sat Dec 22, 2007 6:53 pm

Good catch.

abc123 · Post by **abc123** » Sun Dec 23, 2007 8:18 pm

Leo, as you already know, I don't have access so I am out of luck. There are files in /dsresource directory which I want delete to clear space but once again, I don't have rights.