[torqueusers] memory_pressure and oom-kill.
dbeer at adaptivecomputing.com
Thu Nov 3 10:01:56 MDT 2011
----- Original Message -----
> I've been playing around with cpusets in torque and stumbled across
> memory_pressure thingies you can configure to make pbs_mom take
> action against
> jobs growing out of their memory limits. I got very excited to say
> the least,
> as this might make it possible for us to allow jobs from multiple
> users run on
> the same compute node. (We currently run a SINGLEUSER policy in maui
> and thus
> takes a hit of around 5-10% on the utilization. This is likely to
> worsen as
> we move towards more cores per node.)
> However, some testing revealed a very serious issue: If a job
> passes its
> memory_pressure limits it will be killed no matter if it is
> overallocation its
> memory or not. So if you allow multiple jobs from multiple users to
> run on a
> compute node you can get into scenarios where a well-behaving job
> from userA
> is killed because userB did something stupid. Not a good situation.
> should be fairly simple to check the memory consumption of the job
> deciding to take action.
I'm not certain if a well-behaving job would get killed or not. I think it'd be good to run some tests and see how it happens in practice, although I certainly see the possibility of a well-behaving job also getting killed.
> One could of course rely on oom-kill to take action, but that doesn't
> into play until all the swap is consumed. That is too late for HPC
Other users/customers have found the OOM to come into play too late to help them.
> torqueusers mailing list
> torqueusers at supercluster.org
Direct Line: 801-717-3386 | Fax: 801-717-3738
1712 S East Bay Blvd, Suite 300
Provo, UT 84606
More information about the torqueusers