[torqueusers] Torque not killing job exceeding memory requested

Åke Sandgren ake.sandgren at hpc2n.umu.se
Fri Jan 19 00:07:24 MST 2007


On Thu, 2007-01-18 at 15:14 -0500, Troy Baer wrote:
> On Thu, 2007-01-18 at 13:15 -0600, Laurence Dawson wrote:
> > It's running on x86 linux with a 2.4 kernel,
> > 
> > This is an example job
> > 
> > [root at vmpsched root]# qstat -f 1392706 | grep mem
> > resources_used.mem = 2040216kb
> > resources_used.vmem = 2654428kb
> > Resource_List.mem = 1500mb
> > 
> > [root at vmpsched root]# diagnose -j 1392706
> > JobID State Proc WCLimit User Opsys Class Features
> > 
> > 1392706 Running 1 2:07:00:00 yiy1 - all -
> > WARNING: job '1392706' utilizes more memory than dedicated (1992 >
> 1500)
> > 
> > As recommended by Seb, a couple of minutes ago I enabled the 
> > RESOURCELIMITPOLICY MEM:ALWAYS:CANCEL,
> > 
> > but so far it is still running...
> 
> Hmm...  It looks like the UMU vmem patch disabled mem= enforcement on
> the pbs_mom side.  From torque-2.1.6/src/resmom/linux/mom_mach.c:
> 
> [...]
> /* NOTE:  mem_limit no longer used with UMU patch in place */
> [...]
> /* UMU vmem patch sets RLIMIT_AS rather than RLIMIT_DATA and
> RLIMIT_STACK */
> [..]
> 
> There's a section of commented-out code after the second comment shown
> above that I *think* would re-enable mem= enforcement if you uncommented
> it and recompiled, but I'm not absolutely sure of that.

The basic problem is that memory usage is rather complex on a linux box
(and most other *nix too).

The overmem_proc and mem_sum routines check vsize which the patch above
has already turned over to the kernel (RLIMIT_AS).

So to get torque to take care of the (p)mem limits itself (since linux
kernel can't do that) mom_over_limit could be changed to check for mem
and pmem limits instead of (p)vmem and overmem_proc and mem_sum could
check rss instead. I haven't tried doing this myself since all we really
care about here is (p)vmem.

Any thoughts on that?

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ake at hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



More information about the torqueusers mailing list