[torqueusers] memory_pressure and oom-kill.

Roy Dragseth roy.dragseth at cc.uit.no
Thu Nov 3 03:19:44 MDT 2011


I've been playing around with cpusets in torque and stumbled across the 
memory_pressure thingies you can configure to make pbs_mom take action against 
jobs growing out of their memory limits.  I got very excited to say the least, 
as this might make it possible for us to allow jobs from multiple users run on 
the same compute node.  (We currently run a SINGLEUSER policy in maui and thus 
takes a hit of around 5-10% on the utilization.  This is likely to worsen as 
we move towards more cores per node.)
  However, some testing revealed a very serious issue:  If a job passes its 
memory_pressure limits it will be killed no matter if it is overallocation its 
memory or not.  So if you allow multiple jobs from multiple users to run on a 
compute node you can get into scenarios where a well-behaving job from userA 
is killed because userB did something stupid.  Not a good situation.  It 
should be fairly simple to check the memory consumption of the job before 
deciding to take action.

One could of course rely on oom-kill to take action, but that doesn't come 
into play until all the swap is consumed.  That is too late for HPC uses, 
right?

r.


More information about the torqueusers mailing list