[torqueusers] memory_pressure and oom-kill.

Roy Dragseth roy.dragseth at cc.uit.no
Fri Nov 4 02:58:50 MDT 2011


On Thursday, November 03, 2011 17:01:56 David Beer wrote:
> ----- Original Message -----
> 
> > I've been playing around with cpusets in torque and stumbled across
> > the
> > memory_pressure thingies you can configure to make pbs_mom take
> > action against
> > jobs growing out of their memory limits.  I got very excited to say
> > the least,
> > as this might make it possible for us to allow jobs from multiple
> > users run on
> > the same compute node.  (We currently run a SINGLEUSER policy in maui
> > and thus
> > takes a hit of around 5-10% on the utilization.  This is likely to
> > worsen as
> > we move towards more cores per node.)
> > 
> >   However, some testing revealed a very serious issue:  If a job
> >   passes its
> > 
> > memory_pressure limits it will be killed no matter if it is
> > overallocation its
> > memory or not.  So if you allow multiple jobs from multiple users to
> > run on a
> > compute node you can get into scenarios where a well-behaving job
> > from userA
> > is killed because userB did something stupid.  Not a good situation.
> > 
> >  It
> > 
> > should be fairly simple to check the memory consumption of the job
> > before
> > deciding to take action.
> 
> I'm not certain if a well-behaving job would get killed or not. I think
> it'd be good to run some tests and see how it happens in practice,
> although I certainly see the possibility of a well-behaving job also
> getting killed.
> 
> > One could of course rely on oom-kill to take action, but that doesn't
> > come
> > into play until all the swap is consumed.  That is too late for HPC
> > uses,
> > right?
> 
> Other users/customers have found the OOM to come into play too late to help
> them.

I have indeed tested the memory_pressure functionality and it behaves as I 
described.

I ran two jobs using the stress application from 
http://weather.ou.edu/~apw/projects/stress/

(we use it a lot to test for faulty dimms)

The compute node had 2GB RAM and 1GB swap.

first job 

stress -m 1 --vm-bytes 1000M

second job

stress -m 1 --vm-bytes 1500M

When the second job started, both jobs passed their memory_pressure limit and 
was killed by their respecitve moms after the prescribed grace period.


I can see two immediate solutions to this problem
 1. check the job's RSS against the prescribed pmem or mem value and kill it 
only if it has violated the limit.
 2. trigger a user-definable script and leave it to the script to take the 
appropriate action.

Both have pros and cons, and there might be better solutions.

Any thoughts?

r.



More information about the torqueusers mailing list