swap filling while memory is still available

jasper · Post by **jasper** » Fri Apr 08, 2005 3:53 am

Hi,
I got a job that was running fine, but after increasing the number of nodes it fails. By setting the sort-stages in the job to sequential it runs fine.

Unix admins tell me that the server is giving errors on full-swap space when this job fails. However there is still a lot of memory available.
All parameters I see in the administrator are pointing to other dirs then the /tmp.

system: solaris 8
cpu:8
memory:32GB
swap:5GB

used configfile(16 nodes):
{
node "node1"
{
fastname "tnet121"
pools ""
resource disk "/prod/loc/dts/work/node01/disk00/" {pools ""}
resource scratchdisk "/prod/loc/dts/work/node01/scratch00/" {pools ""}
}
node "node2"
{
fastname "tnet121"
pools ""
resource disk "/prod/loc/dts/work/node02/disk00/" {pools ""}
resource scratchdisk "/prod/loc/dts/work/node02/scratch00/" {pools ""}
}
node "node3"
{
fastname "tnet121"
pools ""
resource disk "/prod/loc/dts/work/node03/disk00/" {pools ""}
resource scratchdisk "/prod/loc/dts/work/node03/scratch00/" {pools ""}
}
node "node4"
{
fastname "tnet121"
pools ""
resource disk "/prod/loc/dts/work/node04/disk00/" {pools ""}
resource scratchdisk "/prod/loc/dts/work/node04/scratch00/" {pools ""}
}
node "node5"
{
fastname "tnet121"
pools ""
resource disk "/prod/loc/dts/work/node05/disk00/" {pools ""}
resource scratchdisk "/prod/loc/dts/work/node05/scratch00/" {pools ""}
}
node "node6"
{
fastname "tnet121"
pools ""
resource disk "/prod/loc/dts/work/node06/disk00/" {pools ""}
resource scratchdisk "/prod/loc/dts/work/node06/scratch00/" {pools ""}
}
node "node7"
{
fastname "tnet121"
pools ""
resource disk "/prod/loc/dts/work/node07/disk00/" {pools ""}
resource scratchdisk "/prod/loc/dts/work/node07/scratch00/" {pools ""}
}
node "node8"
{
fastname "tnet121"
pools ""
resource disk "/prod/loc/dts/work/node08/disk00/" {pools ""}
resource scratchdisk "/prod/loc/dts/work/node08/scratch00/" {pools ""}
}
node "node9"
{
fastname "tnet121"
pools ""
resource disk "/prod/loc/dts/work/node09/disk00/" {pools ""}
resource scratchdisk "/prod/loc/dts/work/node09/scratch00/" {pools ""}
}
node "node10"
{
fastname "tnet121"
pools ""
resource disk "/prod/loc/dts/work/node10/disk00/" {pools ""}
resource scratchdisk "/prod/loc/dts/work/node10/scratch00/" {pools ""}
}
node "node11"
{
fastname "tnet121"
pools ""
resource disk "/prod/loc/dts/work/node11/disk00/" {pools ""}
resource scratchdisk "/prod/loc/dts/work/node11/scratch00/" {pools ""}
}
node "node12"
{
fastname "tnet121"
pools ""
resource disk "/prod/loc/dts/work/node12/disk00/" {pools ""}
resource scratchdisk "/prod/loc/dts/work/node12/scratch00/" {pools ""}
}
node "node13"
{
fastname "tnet121"
pools ""
resource disk "/prod/loc/dts/work/node13/disk00/" {pools ""}
resource scratchdisk "/prod/loc/dts/work/node13/scratch00/" {pools ""}
}
node "node14"
{
fastname "tnet121"
pools ""
resource disk "/prod/loc/dts/work/node14/disk00/" {pools ""}
resource scratchdisk "/prod/loc/dts/work/node14/scratch00/" {pools ""}
}
node "node15"
{
fastname "tnet121"
pools ""
resource disk "/prod/loc/dts/work/node15/disk00/" {pools ""}
resource scratchdisk "/prod/loc/dts/work/node15/scratch00/" {pools ""}
}
node "node16"
{
fastname "tnet121"
pools ""
resource disk "/prod/loc/dts/work/node16/disk00/" {pools ""}
resource scratchdisk "/prod/loc/dts/work/node16/scratch00/" {pools ""}
}

Could it be that DS is somewhere reserving swap-space per sort-process?

ArndW · Post by **ArndW** » Fri Apr 08, 2005 4:22 am

Jasper,

the way UNIX works is that it doesn't start swapping until the available memory is full; so I am not sure what your admin is talking about. There is a difference between getting page faults and swapping so perhaps you saw fault activity and assuming the system was actively using the swap space.

If the swap fills up then you will also have no more physical or virtual memory available to allocate. Depending on your UNIX implementation you will have a certain amount of physical memory showing up as unused as it is reserved for OS and non-pageable space.

Using your virtual memory and the swap file is not necessarily bad, it really depends on your disk I/O page faults/second - if that rate gets too high then you will experience slowdowns and if a certain threshold is reached you will get thrashing (where your system spends more time getting processes and data to/from memory and disk than it spends doing useful things, like executing your program).

Finding that magical balance between CPU, I/O, physical and virtual memory, number of processes and all the system settings is more of an art than a science.

You really should have your swap space set equal to and greater than your physical memory, I like to see at least double. There is no "correct" size, but 5Gb of swap for 32Gb of main memory is certainly a "wrong" configuration. Have your admin raise it.

You didn't say how many nodes you increased from and to how many you went when the jobs started failing. How many nodes did you specify for your 8-cpu system?

jasper · Post by **jasper** » Fri Apr 08, 2005 5:13 am

the exact message the unix-admin gets is:
Apr 8 01:06:02 tnet121 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 13637 (osh)
(in the /var/adm/messages file).

How can we check the memory reserved for OS, we see 12GB available, so that seems a lot for the OS?

We used to have 8 nodes, but tried doubling to 16. For most jobs this is ok, since it still is a good performance -increase. CPU's used to be idle a lot.

ArndW · Post by **ArndW** » Fri Apr 08, 2005 5:32 am

Jasper,

with 8 cpus I think that a 16-node configuration is a good one and should not be a problem. Since you are almost doubling your processes (which can be quite a few for complex jobs) the additional space required for per-process memory is noticeable but not really significant; so you must have been close to your limit even before.

I am not sure which platform you are running on, but I would use the output of the standard vmstat command; start it with vmstat 15 20 (use 15 second samples 20 times for 5 minutes of monitoring) and look at your vital stats as you start, run and perhaps abort your job.

I have to re-iterate that having a swap partition smaller than main memory is not a good thing. Ever. I don't have access to the install manuals for DS EE and can't recall if there is a recommendation in there; but it is common if not usual to have swap double the size of physical memory. And with a multi-cpu system and it's enhanced capabilities that recommended minimum is most likely higher.

If you do a google search for your HW-platform and swap size as keywords you might even get your manufacturer's recommendations in black-and-white instead of having to rely on hearsay

kduke · Post by **kduke** » Fri Apr 08, 2005 5:12 pm

Most UNIX vendors recommend 2 to 3 times swap space as physical memory.

the way UNIX works is that it doesn't start swapping until the available memory is full

I do not think is valid any more especially on Sun Solaris. I do not think one process can grow beyond 2gb on most boxes. If you look at the physical hardware on a E10000 it has several complete computers inside. The phsycial memory is attached to a single CPU or a set of CPUs in a single structure. Most Sun boxes have this design. Their goal is make Solaris run across multiple UNIX machines and act like one big machine. To coordinate virtual memory across many separate physical memory locations on separate buses would be difficult and slow. My guess is it swaps when it hits a physical memory boudary.

I have noticed lately when I run top or vmstat that it shows swap being used at the same time memory available. How do you explain this? These new flavors of UNIX are doing tricks in kernel which were not there several years back. This is all to make it more scalable. These configurations of UNIX are not designed to run one large process but to run lots of small ones.

The same is true on disk performance. ETL creates a few giant disk eating processes which write sequentially to disk drives. The new disk storage arrays are not designed to do this. You can get better performance with no raid and no arrays. Write to raw disk drives with no Veritas and your performance will go up.

All of these hardware solutions are designed to run more and more OLTP users. ETL and OLAP uses disk storage in a completely different manor. Soon we will see hardware designed specifically for ETL. It will kill the current performance numbers.

jwhyman · Post by **jwhyman** » Thu Apr 14, 2005 12:17 am

Before Solaris 2 (SunOS 5), Solaris was notourious for its swap requirements, roughly 2 to 2.5 * physical ram would basically need be allocated. Any memory requirement of a forked process was allocated on swap, with more to come as any anonymous (heap) memory was used. So it was common and possible to run out of swap space even if you had not run out of physical memory.

However now, Solaris 2.x, any program text (code) and libararies are not allocated to swap, they have a perfectly good place on the disk to go to, if the need arises. But anonymous memory has to have some.

So if your process used lots of anonymos memory (memory that has no corresponding place on disk) usage, this wil be heap, data, stack, shared memory, this is your swap space and will be allocated even if it is not used. It will still be possible to fill swap without filling memory (all be it harder).

jasper · Post by **jasper** » Thu Apr 14, 2005 12:28 am

hi,

this wil be heap, data, stack, shared memory, this is your swap space and will be allocated even if it is not used

at install we had problems with the memory-settings that needed to be done in uvconfig (the values coming out of the shmemtest were invalid).
Could this then be related?

kduke · Post by **kduke** » Thu Apr 14, 2005 7:50 am

I think the command is shmtest. If you set these wrong then jobs core dump. Not good. If the command itself is wrong then I have problems as well. Are you sure this command is incorrect?