[torqueusers] memory_pressure and oom-kill.

David Beer dbeer at adaptivecomputing.com
Thu Nov 3 10:01:56 MDT 2011



----- Original Message -----
> I've been playing around with cpusets in torque and stumbled across
> the
> memory_pressure thingies you can configure to make pbs_mom take
> action against
> jobs growing out of their memory limits.  I got very excited to say
> the least,
> as this might make it possible for us to allow jobs from multiple
> users run on
> the same compute node.  (We currently run a SINGLEUSER policy in maui
> and thus
> takes a hit of around 5-10% on the utilization.  This is likely to
> worsen as
> we move towards more cores per node.)
>   However, some testing revealed a very serious issue:  If a job
>   passes its
> memory_pressure limits it will be killed no matter if it is
> overallocation its
> memory or not.  So if you allow multiple jobs from multiple users to
> run on a
> compute node you can get into scenarios where a well-behaving job
> from userA
> is killed because userB did something stupid.  Not a good situation.
>  It
> should be fairly simple to check the memory consumption of the job
> before
> deciding to take action.
> 

I'm not certain if a well-behaving job would get killed or not. I think it'd be good to run some tests and see how it happens in practice, although I certainly see the possibility of a well-behaving job also getting killed.

> One could of course rely on oom-kill to take action, but that doesn't
> come
> into play until all the swap is consumed.  That is too late for HPC
> uses,
> right?

Other users/customers have found the OOM to come into play too late to help them.

> 
> r.
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 

-- 
David Beer 
Direct Line: 801-717-3386 | Fax: 801-717-3738
     Adaptive Computing
     1712 S East Bay Blvd, Suite 300
     Provo, UT 84606



More information about the torqueusers mailing list