[torqueusers] How to enforce pmem requirements
Gareth.Williams at csiro.au
Gareth.Williams at csiro.au
Fri Feb 20 05:31:46 MST 2009
We actually have very little swap (or none) so we're not in danger of thrashing but then we are not using suspend/resume in scheduling either. We're effectively using this mechanism to match vmem resource request/usage to physical memory allocation.
Before the patch, maui/moab was comparing vmem with swap, rather than (swap+physical memory) - and getting that calculation wrong in cases where jobs were sharing nodes.
From: David Singleton [David.Singleton at anu.edu.au]
Sent: Friday, 20 February 2009 6:35 PM
To: Williams, Gareth (HPSC, Melbourne)
Cc: moye at rice.edu; torqueusers at supercluster.org
Subject: Re: [torqueusers] How to enforce pmem requirements
My reading of your post:
"With the patch, maui/moab correctly detects a node's available vmem
as being the total physical memory plus swap space. This can then be
scheduled/allocated by requesting vmem and job's virtual memory
allocation can be limited on a per process basis and periodically
measured (and action taken or overuse) on a per job basis."
is that jobs are able to thrash away in swap under these limits.
Probably not what you want.
I guess I would say vmem has nothing much to do with swap. Huh?!
vmem (as the term is used in PBS) is to do with the virtual address
ranges of processes (VSZ) and, for better or worse, a job vmem is the
sum of these guys. swap is actually used by physical pages - its
where physical pages that dont fit in memory go.
The useful thing about process VSZ is that it is an upper bound on
a process's physical memory use (resident + swapped pages). We use a
conservative vmem allocation scheme where the sum of vmems of
running jobs(*) has to fit in the physical memory of the nodes of
those jobs. AFAICT, currently, that's the only way of guaranteeing
you wont run out of memory or suffer swap thrashing:
sum job physical memory <= sum job virtual memory <= node physical memory
If Linux provided a measure of process physical memory use (RSS+swap),
then that is what should be used for limiting directly, i.e. PBS mem
should be defined as (RSS+swap) instead of just RSS and use the
sum job physical memory (mem) <= node physical memory
Hopefully that is what we will get with cgroup memory controllers.
Or maybe that is what your patch is using now?
(*) we do a fair bit of doctoring of vmem evaluation for shared maps etc.
Gareth.Williams at csiro.au wrote:
> Hi All,
> I've just posted on mauiusers and maobusers on scheduling using vmem. This might be a good option for you.
> The post is at:
> Gareth Williams
> CSIRO IM&T - ASC
>> -----Original Message-----
>> From: David Singleton [mailto:David.Singleton at anu.edu.au]
>> Sent: Thursday, 19 February 2009 7:45 AM
>> To: Roger Moye
>> Cc: torqueusers at supercluster.org
>> Subject: Re: [torqueusers] How to enforce pmem requirements
>> Strictly speaking, pmem limits cant always stop nodes running out of
>> memory even if enforced.
>> a. A job can start an arbitrary number of processes none of which
>> exceed the pmem limit.
>> b. It is conceivable for apparently reasonable pmem limits to never
>> be hit by a job that fills swap. Consider a 4 cpu node with
>> 4GB of memory. A reasonable pmem limit would apparently be
>> 1GB. However 4 processes growing memory use at the same rate
>> will never reach that limit. They will start paging at some
>> lower value and can continue paging until the node runs out
>> of swap.
>> My other problem with pmem (and mem) limits is that they are
>> The same job running on the same node may run totally under the limit
>> one run and hit the limit on another run. Process physical memory
>> use depends not only on the job/process but also on the system state.
>> Sorry for not being helpful.
>> Roger Moye wrote:
>>> We have Torque/Moab running on one cluster and Torque/Maui on another.
>>> We encourage our users to use the pmem option to specify their memory
>>> requirements in their PBS batch scripts. Is there a way to get the
>>> scheduler to enforce these limits? That is, if a job attempts to exceed
>>> the pmem value we want the scheduler to kill the job just like it would
>>> if it exceeded its walltime. Currently we have a few users who have
>>> their jobs exceed their pmem value and the result is trashed nodes
>>> because the jobs have consumed too much memory.
>>> Thanks in advance for any help or advice!
> torqueusers mailing list
> torqueusers at supercluster.org
Dr David Singleton ANU Supercomputer Facility
HPC Systems Manager and NCI National Facility
David.Singleton at anu.edu.au Leonard Huxley Bldg (No. 56)
Phone: +61 2 6125 4389 Australian National University
Fax: +61 2 6125 8199 Canberra, ACT, 0200, Australia
More information about the torqueusers