[torqueusers] job exceed memory limit without been killed
chindw at wfu.edu
Wed Oct 13 09:10:48 MDT 2010
I am just looking through the source for torque-2.5.2, and it does not
seem as if this patch made it in.
torque-2.5.2/src/resmom/linux/mom_mach.c mom_over_limit() only checks
these resources: cput, pcput, vmem, pvmem, walltime.
David Chin, Ph.D.
chindw at wfu.edu High Performance Computing Systems Analyst
Office: 336-758-2964 Wake Forest University
Mobile: 336-608-0793 Winston-Salem, NC
Email-to-txt: 3366080793 at mms.att.net
Google Talk: chindw at wfu.edu
On Thu, Mar 18, 2010 at 11:51, Anton Starikov <ant.starikov at gmail.com> wrote:
> Patch to fix this bug is attached.
> On Mar 18, 2010, at 4:12 PM, Anton Starikov wrote:
>> OK, I've found a bug.
>> normally "mem" limit checked in job_over_limit(). But if there only one node assigned to the job (which is my case, 1 node 16 processes), then it ask to check mom_over_limit() and exits.
>> And mom_over_limit() doesn't check for "mem" limit by obvious reasons.
>> On Mar 18, 2010, at 3:35 PM, Anton Starikov wrote:
>>> Problem here that, if I understand correctly, that MAUI gather this information within scheduling interval, which is normally sufficiently larger than pooling interval of PBM_MOM. And PBS_MOM has to kill job within pooling interval.
>>> On Mar 18, 2010, at 3:18 PM, Sabuj Pattanayek wrote:
>>>> pbs_mom just reports it to your scheduler. Then the scheduler (maui in
>>>> my case) has cancel the job, then pbs_mom kills it. Which doesn't work
>>>> in my case even though maui says it's canceling the job.
>>>> On Thu, Mar 18, 2010 at 9:15 AM, Anton Starikov <ant.starikov at gmail.com> wrote:
>>>>> Actually, setting this policy in MAUI kills jobs in my case. But I think PBS_MOM has to deal with this limits itself, isn't it the case?
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers