[Mauiusers] Error: Standing Reservation cannot be created

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Thu Sep 6 06:57:17 MDT 2007


A workaround has been found - apparently our problem is due to a bug,
even in the latest Maui snapshot version maui-3.2.6p20-snap.1182974819.

After much experimentation it turned out that Maui refuses to
create a Standing Reservation if the nodes in question are
in an "offline" status in Torque.  The workaround is thus:
1) Stop Maui, 2) clear the offline status with "pbsnodes -c nodelist",
3) restart Maui including the Standing Reservations configuration.

The reason we offlined our new nodes is of course that we didn't
want any jobs to start on those nodes when they initially came up.
We hoped to control access to the nodes using Standing Reservations.

IMHO, the fix needed in Maui is to honor Standing Reservations even
for nodes that are in an "offline" or even "down" state.

Maui developers:  Is this a feasible and desirable modification ?

Thanks,
Ole

Ole Holm Nielsen wrote:
> We're trying to set up a new Standing Reservation in the Maui maui.cfg 
> file so that a set of newly installed nodes should be reserved for a 
> small group of
> test users.
> 
> We have an old SR that works perfectly, but the new SR named "switch5" 
> cannot
> be created as shown in the maui.log:
> 
> ...
> 09/04 14:30:43 INFO:     MNode[083] 'q083' added to regex list
> 09/04 14:30:43 INFO:     MNode[084] 'q084' added to regex list
> 09/04 14:30:43 MSRSetRes(switch5,1,0)
> 09/04 14:30:43 MJobSetCreds(switch5.0,[ALL],[ALL],[ALL])
> 09/04 14:30:43 MSRGetAttributes(switch5,0,Start,Duration)
> 09/04 14:30:43 INFO:     attempting standing reservation of 336 procs in 
> -INFINITY for   INFINITY
> 09/04 14:30:43 
> MSRSelectNodeList(switch5.0,switch5,DstNL,NodeCount,00:00:00,ReqNL,12)
> 09/04 14:30:43 INFO:     0 feasible tasks found for job switch5.0:0 in 
> partition DEFAULT (1 Needed)
> 09/04 14:30:43 ALERT:    cannot select 336 procs in partition '[ALL]' 
> for SR 'switch5'
> 09/04 14:30:43 MSRSetRes(switch5,1,1)
> 09/04 14:30:43 MJobSetCreds(switch5.1,[ALL],[ALL],[ALL])
> 09/04 14:30:43 MSRGetAttributes(switch5,1,Start,Duration)
> 09/04 14:30:43 INFO:     reservation not required for specified period
> 09/04 14:30:43 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg)
> ...
> 
> Apparently the 84 nodes (4 CPUs each) are located correctly, but the reason
> for the above ALERT message is incomprehensible !  The net result is that
> the configured SR isn't working, and the new nodes run production jobs
> that shouldn't land on these nodes.  This is a big problem for us :-(
> I looked into the code in src/moab/MJob.c without gaining any understanding
> of the problem (my fault, of course :-).
> 
> Question: Can anyone point to what's wrong with our SR's or with Maui 
> itself ?
> 
> FYI, we run Torque 2.1.8 and Maui 3.2.6p20.  This is an excerpt from our 
> maui.cfg:
> 
> NODESETPOLICY           ONEOF
> NODESETATTRIBUTE        FEATURE
> NODESETDELAY            1
> NODESETLIST             switch1 switch2 switch3 switch4 switch5 infiniband
> NODESETPRIORITYTYPE     BESTFIT
> # Reservation of the nodes p0XX with Infiniband
> SRCFG[infiniband]       HOSTLIST=p0[012][0-9]
> SRCFG[infiniband] 
> USERLIST=jensj,bligaard,ohnielse,moses,efernand,studt,ibensig,dc
> SRCFG[infiniband]       PERIOD=INFINITY
> SRCFG[infiniband]       NODEFEATURES=infiniband
> # Testing of new nodes q0XX
> SRCFG[switch5]       HOSTLIST=q0[0-9][0-9]
> SRCFG[switch5]       USERLIST=jensj,dulak,ohnielse
> SRCFG[switch5]       PERIOD=INFINITY
> SRCFG[switch5]       NODEFEATURES=switch5

-- 
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


More information about the mauiusers mailing list