[torqueusers] job exceed memory limit without been killed

Anton Starikov ant.starikov at gmail.com
Thu Mar 18 09:51:08 MDT 2010


Patch to fix this bug is attached.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: mem_limit_kill.patch
Type: application/octet-stream
Size: 696 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20100318/315367b2/attachment.obj 
-------------- next part --------------

On Mar 18, 2010, at 4:12 PM, Anton Starikov wrote:

> OK, I've found a bug.
> 
> normally "mem" limit checked in job_over_limit(). But if there only one node assigned to the job (which is my case, 1 node 16 processes), then it ask to check mom_over_limit() and exits.
> And mom_over_limit() doesn't check for "mem" limit by obvious reasons.
> 
> 
> On Mar 18, 2010, at 3:35 PM, Anton Starikov wrote:
> 
>> Problem here that, if I understand correctly, that MAUI gather this information within scheduling interval, which is normally sufficiently larger than pooling interval of PBM_MOM. And PBS_MOM has to kill job within pooling interval.
>> 
>> 
>> On Mar 18, 2010, at 3:18 PM, Sabuj Pattanayek wrote:
>> 
>>> pbs_mom just reports it to your scheduler. Then the scheduler (maui in
>>> my case) has cancel the job, then pbs_mom kills it. Which doesn't work
>>> in my case even though maui says it's canceling the job.
>>> 
>>> On Thu, Mar 18, 2010 at 9:15 AM, Anton Starikov <ant.starikov at gmail.com> wrote:
>>>> Actually, setting this policy in MAUI kills jobs in my case. But I think PBS_MOM has to deal with this limits itself, isn't it the case?
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> 
> 



More information about the torqueusers mailing list