[Mauiusers] RESERVATIONDEPTH not working as I expected

Jim Lawson jtl+supercluster at uvm.edu
Mon Apr 14 13:15:30 MDT 2008


Hello mauiusers,

At UVM's VACC we are running maui 3.2.6p19, torque 2.1.8.  Things had 
been working OK until I started trying to tweak how the scheduler 
works... :-)

The problem I am trying to solve is: backfill-starvation.  The 
highest-priority job, typically a long-running job needing lots of 
processors, does OK, but the other big jobs often have to wait a long 
time, often days, to get to that first position where they get a 
reservation.  Meanwhile the cluster is busy with lots of little tiny jobs.

So, to get more of the larger jobs running sooner, I set 
RESERVATIONDEPTH to 2, thinking that it would then make reservations for 
the 2 highest priority jobs.

However, more than 2 reservations for jobs in I state are typically 
created.  The top 2 jobs get a reservation, plus 2 (or more!) for some 
of the other, lower-priority jobs.  All the jobs have of the same QOS 
(DEFAULT), so I don't see how RESERVATIONQOSLIST would apply. 

What's worse, something seems to be wrong with the reservations made... 
Often, a lower-priority job's reservation comes due, but maui doesn't 
start the job.  Then the running jobs start to drain, because maui isn't 
starting any jobs at all.  It seems to just get "stuck".

I can get past the problem by running "runjob" to kick the job into 
Running state, but then it's usually
only a few hours typically before it jams up again.

I am also noticing ALERTs showing up in my logs that may (?) be related 
to this:

> 04/14 14:32:38 ALERT:    node 'node028.cluster' sync from expected 
> state 'Idle' to state 'Running' at Mon Apr 14 14:32:38
> 04/14 14:32:38 ALERT:    node 'node029.cluster' sync from expected 
> state 'Idle' to state 'Running' at Mon Apr 14 14:32:38
> 04/14 14:32:38 ALERT:    node 'node030.cluster' sync from expected 
> state 'Idle' to state 'Running' at Mon Apr 14 14:32:38
> 04/14 14:32:38 ALERT:    node 'node031.cluster' sync from expected 
> state 'Idle' to state 'Running' at Mon Apr 14 14:32:38
> 04/14 14:32:38 ALERT:    node 'node032.cluster' sync from expected 
> state 'Idle' to state 'Running' at Mon Apr 14 14:32:38
> 04/14 14:32:38 ALERT:    node 'node042.cluster' sync from expected 
> state 'Idle' to state 'Running' at Mon Apr 14 14:32:38
> 04/14 14:32:38 ALERT:    node 'node111.cluster' sync from expected 
> state 'Idle' to state 'Running' at Mon Apr 14 14:32:38
For those willing to take a look, config and log file dumps are available at

http://www.uvm.edu/~jtl/mauiprob/

Thanks for any assistance that can be provided.

-- 
Jim Lawson
Systems Architecture & Administration
Enterprise Technology Services
University of Vermont
Burlington, VT USA 




More information about the mauiusers mailing list