[torqueusers] Torque not killing job exceeding memory requested

Troy Baer troy at osc.edu
Thu Jan 18 13:14:43 MST 2007


On Thu, 2007-01-18 at 13:15 -0600, Laurence Dawson wrote:
> It's running on x86 linux with a 2.4 kernel,
> 
> This is an example job
> 
> [root at vmpsched root]# qstat -f 1392706 | grep mem
> resources_used.mem = 2040216kb
> resources_used.vmem = 2654428kb
> Resource_List.mem = 1500mb
> 
> [root at vmpsched root]# diagnose -j 1392706
> JobID State Proc WCLimit User Opsys Class Features
> 
> 1392706 Running 1 2:07:00:00 yiy1 - all -
> WARNING: job '1392706' utilizes more memory than dedicated (1992 >
1500)
> 
> As recommended by Seb, a couple of minutes ago I enabled the 
> RESOURCELIMITPOLICY MEM:ALWAYS:CANCEL,
> 
> but so far it is still running...

Hmm...  It looks like the UMU vmem patch disabled mem= enforcement on
the pbs_mom side.  From torque-2.1.6/src/resmom/linux/mom_mach.c:

[...]
/* NOTE:  mem_limit no longer used with UMU patch in place */
[...]
/* UMU vmem patch sets RLIMIT_AS rather than RLIMIT_DATA and
RLIMIT_STACK */
[..]

There's a section of commented-out code after the second comment shown
above that I *think* would re-enable mem= enforcement if you uncommented
it and recompiled, but I'm not absolutely sure of that.

Garrick, any thoughts on this?

	--Troy
-- 
Troy Baer                       troy at osc.edu
Science & Technology Support    http://www.osc.edu/hpc/
Ohio Supercomputer Center       614-292-9701




More information about the torqueusers mailing list