[torqueusers] Torque not killing job exceeding memory requested

Laurence Dawson larry.dawson at vanderbilt.edu
Fri Jan 19 14:47:07 MST 2007


Åke Sandgren wrote:
> On Thu, 2007-01-18 at 15:14 -0500, Troy Baer wrote:
>   
>> On Thu, 2007-01-18 at 13:15 -0600, Laurence Dawson wrote:
>>     
>>> It's running on x86 linux with a 2.4 kernel,
>>>
>>> This is an example job
>>>
>>> [root at vmpsched root]# qstat -f 1392706 | grep mem
>>> resources_used.mem = 2040216kb
>>> resources_used.vmem = 2654428kb
>>> Resource_List.mem = 1500mb
>>>
>>> [root at vmpsched root]# diagnose -j 1392706
>>> JobID State Proc WCLimit User Opsys Class Features
>>>
>>> 1392706 Running 1 2:07:00:00 yiy1 - all -
>>> WARNING: job '1392706' utilizes more memory than dedicated (1992 >
>>>       
>> 1500)
>>     
>>> As recommended by Seb, a couple of minutes ago I enabled the 
>>> RESOURCELIMITPOLICY MEM:ALWAYS:CANCEL,
>>>
>>> but so far it is still running...
>>>       
>> Hmm...  It looks like the UMU vmem patch disabled mem= enforcement on
>> the pbs_mom side.  From torque-2.1.6/src/resmom/linux/mom_mach.c:
>>
>> [...]
>> /* NOTE:  mem_limit no longer used with UMU patch in place */
>> [...]
>> /* UMU vmem patch sets RLIMIT_AS rather than RLIMIT_DATA and
>> RLIMIT_STACK */
>> [..]
>>
>> There's a section of commented-out code after the second comment shown
>> above that I *think* would re-enable mem= enforcement if you uncommented
>> it and recompiled, but I'm not absolutely sure of that.
>>     
>
> The basic problem is that memory usage is rather complex on a linux box
> (and most other *nix too).
>
> The overmem_proc and mem_sum routines check vsize which the patch above
> has already turned over to the kernel (RLIMIT_AS).
>
> So to get torque to take care of the (p)mem limits itself (since linux
> kernel can't do that) mom_over_limit could be changed to check for mem
> and pmem limits instead of (p)vmem and overmem_proc and mem_sum could
> check rss instead. I haven't tried doing this myself since all we really
> care about here is (p)vmem.
>
> Any thoughts on that?
>
>   
Is this a problem since a particular version? - the email Gabe Turner 
sent indicates this is not happening in 2.1.6 (at least for him). The 
comment Troy quoted is not the same as the one in the version of torque 
I am  running(2.1.0p0), but there is a section of commented out code 
that looks like this:

      /* UMU vmem patch sets RLIMIT_AS rather than RLIMIT_DATA and 
RLIMIT_STACK */
 
      /*
      reslim.rlim_cur = reslim.rlim_max = mem_limit;
 
      if (setrlimit(RLIMIT_DATA,&reslim) < 0)
        {
        return(error("RLIMIT_DATA",PBSE_SYSTEM));
        }
 
      if (setrlimit(RLIMIT_STACK,&reslim) < 0)
        {
        return(error("RLIMIT_STACK",PBSE_SYSTEM));
        }
      */



More information about the torqueusers mailing list