[Mauiusers] MAUI reservation table corrupted

Michel Jouvin jouvin at lal.in2p3.fr
Sat Apr 26 10:37:00 MDT 2008


Hi,

I am having troubles with MAUI after adding new nodes and getting a higher
number of running jobs. I use MAUI with a configuration where each node is
configured with a Torque number_of_procs = 2 * num_of_CPUs and 1/2 this
number of proc put in a SR attached to the node. Typical SR configuration
is (there is one per node in fact) :

SRCFG[sdj_0] HOSTLIST=grid33.lal.in2p3.fr
SRCFG[sdj_0] PERIOD=INFINITY
SRCFG[sdj_0] ACCESS=DEDICATED
SRCFG[sdj_0] PRIORITY=10
SRCFG[sdj_0] TASKCOUNT=1
SRCFG[sdj_0] RESOURCES=PROCS:4
SRCFG[sdj_0] CLASSLIST=dteam,ops,sdj

Current total number of nodes and number of jobs as reported by diagnose -n
and showq - r are :

Total Nodes: 170  (Active: 167  Idle: 1  Down: 2)

1267 Jobs    1267 of  2552 Processors Active (49.65%)


In this configuration, 'diagnose -r' lists a certain number of reservation
made for active jobs and then output is truncated with error :

NOTE:  list truncated

Active Reserved Processors: 1247
WARNING:  reservation table is corrupt:  active procs reserved does not
equal active procs detected (1247 != 1267)

In maui.log (LOGLEVEL 2), I cannot find any error related to this using
grep -E 'WARN|ERROR|ALERT' /var/log/maui.log. The only thing, but this is
not clear it is a related problem, for some jobs there are entries like :

04/19 14:36:07 INFO:     active PBS job 33633 has been removed from the
queue.  assuming successful completion
04/19 14:36:07 ALERT:    job '             33633' has invalid system queue
time (SQ: 1208607298 > ST: 1208567570)

This is a major problem for us as we rely on 'diagnose -r output' to
compute and publish used and available job slots and CPUs.

We are running :

torque-devel-2.3.0-snap.200801151629.2
maui-client-3.2.6p20-snap.1182974819.9

rebuilt with some non default hard limits :

Parameter : Default Setting : Current Setting :
MAX_MCLASS : 16 : 64 :
MMAX_JOB : 4096 : 32768 :
MAX_MJOB_TRACE : 4096 : 32768 :
MAX_MRES : 1024 : 8192 :
MMAX_SRES : 128 : 1024
MMAX_NODE : 5129 : 5120 :

Thanks in advance for any help.

Michel

     *************************************************************
     * Michel Jouvin                 Email : jouvin at lal.in2p3.fr *
     * LAL / CNRS                    Tel : +33 1 64468932        *
     * B.P. 34                       Fax : +33 1 69079404        *
     * 91898 Orsay Cedex                                         *
     * France                                                    *
     *************************************************************




More information about the mauiusers mailing list