[Mauiusers] Problem with Standing Reservations when having a large number of jobs running

Michel Jouvin jouvin at lal.in2p3.fr
Thu May 29 14:14:08 MDT 2008


I already sent a message to the list one month ago about my problems using 
standing reservations in a cluster with 200 machines/ 1500 cores but it 
remained unanswered.

We are running :


rebuilt with some non default hard limits, in particular for reservations 
and standing reservations :

Parameter : Default Setting : Current Setting :
MAX_MCLASS : 16 : 64 :
MMAX_JOB : 4096 : 32768 :
MAX_MJOB_TRACE : 4096 : 32768 :
MAX_MRES : 1024 : 8192 :
MMAX_SRES : 128 : 1024
MMAX_NODE : 5129 : 5120 :

Our configuration was including 1 standing reservation per machine (around 
250 in total). As long as there is a small number of jobs running 
everything is ok : diagnose -r displays allow the active reservations and 
standing reservations.

After reaching a certain number of running jobs (almost all our jobs are 
using 1 core) we have not been able to determine precisely (between 900 and 
1200, doesn't seem to be 1024!), diagnose -r is no longer able to list all 
reservations and output is ending with a message like :

NOTE:  list truncated

Active Reserved Processors: 1235
WARNING:  reservation table is corrupt:  active procs reserved does not 
equal active procs detected (1235 != 2097)

where 2097 is slightly less than the number of running jobs (whatever is 
the actual number, it is always close to number of running jobs - 20). When 
this problem begins, standing reservations are no longer there and the 
consequence is that Torque PROCS normally reserved by standing reservations 
appear free and jobs are scheduled using these PROCS leading to an 
unexpected load on worker nodes. Unfortunatly there is no message in MAUI 
log file... but the impact on scheduling shows this is not a diagnose 

Even if this is not clear if the problem is related, under these high load 
conditions MAUI is crashing very frequently (we have a cron job that 
restarts it every 5 minutes if it is no longer there).

Thanks in advance for any help, hint or troubleshooting advice. BTW, is 
there any more recent version of MAUI than the one we are running ? I have 
not found anything on clusterressources.com web site but there is may be a 
CVS or SVN repository where to download more recent snapshots.



