[torqueusers] memory_pressure and oom-kill.

Roy Dragseth roy.dragseth at cc.uit.no
Thu Nov 3 03:19:44 MDT 2011

I've been playing around with cpusets in torque and stumbled across the 
memory_pressure thingies you can configure to make pbs_mom take action against 
jobs growing out of their memory limits.  I got very excited to say the least, 
as this might make it possible for us to allow jobs from multiple users run on 
the same compute node.  (We currently run a SINGLEUSER policy in maui and thus 
takes a hit of around 5-10% on the utilization.  This is likely to worsen as 
we move towards more cores per node.)
  However, some testing revealed a very serious issue:  If a job passes its 
memory_pressure limits it will be killed no matter if it is overallocation its 
memory or not.  So if you allow multiple jobs from multiple users to run on a 
compute node you can get into scenarios where a well-behaving job from userA 
is killed because userB did something stupid.  Not a good situation.  It 
should be fairly simple to check the memory consumption of the job before 
deciding to take action.

One could of course rely on oom-kill to take action, but that doesn't come 
into play until all the swap is consumed.  That is too late for HPC uses, 


More information about the torqueusers mailing list