[Mauiusers] Re: [OMPI users] [torqueusers] Job dies randomly, but only through torque

Jim Kusznir jkusznir at gmail.com
Mon Jun 2 10:13:05 MDT 2008


I did turn off resource enforcement (cancel), and the jobs are running
properly now.

The numbers below on load are being multiplied by 100.  I personally
observed the "372" was a node load of 3.72 according to w/top/etc.
What bothers me is that maui believes the job is only entitled to 100
(1.00, or a single CPU).  It definately scheduled the job on the
requested 4 CPUs, and the job was submitted with both (on separate
occasions) nodes=4:ppn=1 and nodes=1:ppn=4, both with identical
results.

I don't recall ever setting the "Resource_List.ncpus=1", and I didn't
find that in maui.cfg; is there somewhere else I should be looking for
that?

Thanks everyone for your help!

--Jim

On Thu, May 29, 2008 at 12:08 PM, Jan Ploski <Jan.Ploski at offis.de> wrote:
> Jim Kusznir wrote:
>>
>> I have verified that maui is killing the job.  I actually ran into
>> this with another user all of a sudden.  I don't know why its only
>> effecting a few currently.  Here's the maui log extract for a current
>> run of this users' program:
>>
> ...
>>
>> maui.log:05/29 09:27:21 INFO:     job 2120 exceeds requested proc
>> limit (3.72 > 1.00)
>> maui.log:05/29 09:27:21 MSysRegEvent(JOBRESVIOLATION:  job '2120' in
>> state 'Running' has exceeded PROC resource limit (372 > 100) (action
>> CANCEL will be taken)  job start time: Thu May 29 09:26:19
>
> ...
>
> Here is a little theory that I think fits your present observations:
>
> 1. You have "Resource_List.ncpus = 1" (I think this is what Maui calls the
> PROC resource limit.)
> 2. You also have the Maui configuration parameter RESOURCELIMITPOLICY set to
> CANCEL.
> 3. The job's executable starts multiple threads or subprocesses (perhaps
> instead of distributing them to all the remaining nodes?)
>
> Therefore Maui shoots it down, having noticed that it uses more than it is
> supposed to. It would help towards a solution if you could verify whether
> these points are true (1 => qstat -f, 2 => view config, 3 => use top or ps).
>
> Regards,
> Jan Ploski
>


More information about the mauiusers mailing list