[torqueusers] memory_pressure and oom-kill.
Roy Dragseth
roy.dragseth at cc.uit.no
Thu Nov 3 03:19:44 MDT 2011
I've been playing around with cpusets in torque and stumbled across the
memory_pressure thingies you can configure to make pbs_mom take action against
jobs growing out of their memory limits. I got very excited to say the least,
as this might make it possible for us to allow jobs from multiple users run on
the same compute node. (We currently run a SINGLEUSER policy in maui and thus
takes a hit of around 5-10% on the utilization. This is likely to worsen as
we move towards more cores per node.)
However, some testing revealed a very serious issue: If a job passes its
memory_pressure limits it will be killed no matter if it is overallocation its
memory or not. So if you allow multiple jobs from multiple users to run on a
compute node you can get into scenarios where a well-behaving job from userA
is killed because userB did something stupid. Not a good situation. It
should be fairly simple to check the memory consumption of the job before
deciding to take action.
One could of course rely on oom-kill to take action, but that doesn't come
into play until all the swap is consumed. That is too late for HPC uses,
right?
r.
More information about the torqueusers
mailing list