[torqueusers] Torque not killing job exceeding memory requested

Gabe Turner gabe at msi.umn.edu
Fri Jan 19 07:47:00 MST 2007


On Thu, Jan 18, 2007 at 01:15:06PM -0600, Laurence Dawson wrote:
> It's running on x86 linux with a 2.4 kernel,
> 
> This is an example job
> 
> [root at vmpsched root]# qstat -f 1392706 | grep mem
> resources_used.mem = 2040216kb
> resources_used.vmem = 2654428kb
> Resource_List.mem = 1500mb
> 
> [root at vmpsched root]# diagnose -j 1392706
> JobID State Proc WCLimit User Opsys Class Features
> 
> 1392706 Running 1 2:07:00:00 yiy1 - all -
> WARNING: job '1392706' utilizes more memory than dedicated (1992 > 1500)
> 
> As recommended by Seb, a couple of minutes ago I enabled the 
> RESOURCELIMITPOLICY MEM:ALWAYS:CANCEL,
> 
> but so far it is still running...

I'm using the following with Moab 5.0.0, Torque 2.1.6 and the Linux 2.6
kernel and it's working flawlessly:

RESOURCELIMITPOLICY     MEM:ALWAYS,EXTENDEDVIOLATION:NOTIFY,CANCEL:00:30:00
RESOURCELIMITMULTIPLIER JOBMEM:1.10

It notifies the user of the overage immediately, then cancels the job 30
minutes later.  An overrun of 10% is allowed.

HTH,

Gabe
-- 
Gabe Turner                                             gabe at msi.umn.edu
UNIX System Administrator,
University of Minnesota
Supercomputing Institute                          http://www.msi.umn.edu


More information about the torqueusers mailing list