[Mauiusers] Re: [OMPI users] [torqueusers] Job dies randomly, but only through torque

Jan Ploski Jan.Ploski at offis.de
Thu May 29 13:08:42 MDT 2008


Jim Kusznir wrote:
> I have verified that maui is killing the job.  I actually ran into
> this with another user all of a sudden.  I don't know why its only
> effecting a few currently.  Here's the maui log extract for a current
> run of this users' program:
>
...
> maui.log:05/29 09:27:21 INFO:     job 2120 exceeds requested proc
> limit (3.72 > 1.00)
> maui.log:05/29 09:27:21 MSysRegEvent(JOBRESVIOLATION:  job '2120' in
> state 'Running' has exceeded PROC resource limit (372 > 100) (action
> CANCEL will be taken)  job start time: Thu May 29 09:26:19
...

Here is a little theory that I think fits your present observations:

1. You have "Resource_List.ncpus = 1" (I think this is what Maui calls 
the PROC resource limit.)
2. You also have the Maui configuration parameter RESOURCELIMITPOLICY 
set to CANCEL.
3. The job's executable starts multiple threads or subprocesses (perhaps 
instead of distributing them to all the remaining nodes?)

Therefore Maui shoots it down, having noticed that it uses more than it 
is supposed to. It would help towards a solution if you could verify 
whether these points are true (1 => qstat -f, 2 => view config, 3 => use 
top or ps).

Regards,
Jan Ploski


More information about the mauiusers mailing list