[Mauiusers] More information on standing reservation problem ...

Richard Walsh rbw at ahpcrc.org
Tue Mar 8 16:54:58 MST 2005


All,

Still having a problem getting my configured standing reservations to
control nodes allocated to the jobs submitted to a particular queue.


Here is the SR piece of my maui.config file (looks ok to me):

SRCFG[srtest]         PERIOD=INFINITY
SRCFG[srtest]         DAYS=ALL
SRCFG[srtest]         TIMELIMIT=4:00:00
SRCFG[srtest]         TASKCOUNT=4 RESOURCES=PROCS:1;MEM:3500
SRCFG[srtest]         ACCOUNTLIST=root,mrobo,sko,shirron,rbw
SRCFG[srtest]         HOSTLIST=node001,node002
SRCFG[srtest]         CLASSLIST=test

SRCFG[srexpr]         PERIOD=INFINITY
SRCFG[srexpr]         DAYS=ALL
SRCFG[srexpr]         TIMELIMIT=30:00
SRCFG[srexpr]         TASKCOUNT=16 RESOURCES=PROCS:1;MEM:3500
SRCFG[srexpr]         
HOSTLIST=node003,node004,node005,node006,node007,node008,node009,node010
SRCFG[srexpr]         CLASSLIST=express

SRCFG[srsmem]         PERIOD=INFINITY
SRCFG[srsmem]         DAYS=ALL
SRCFG[srsmem]         TIMELIMIT=16:00:00
SRCFG[srsmem]         TASKCOUNT=96 RESOURCES=PROCS:1;MEM:3500
SRCFG[srsmem]         
HOSTLIST=node011,node012,node013,node014,node015,node016,node017,node018,node019,node020,node021,node022,node023,node024,node025,node026,node027,node028,node029,node030,node031,node032,node033,node034,node035,node036,node037,node038,node039,node040,node041,node042,node043,node044,node045,node046,node047,node048,node049,node050,node051,node052,node053,node054,node055,node056,node057,node058
SRCFG[srsmem]         CLASSLIST=parallel,serial

SRCFG[srbmem]         PERIOD=INFINITY
SRCFG[srbmem]         DAYS=ALL
SRCFG[srbmem]         TIMELIMIT=16:00:00
SRCFG[srbmem]         TASKCOUNT=32 RESOURCES=PROCS:1;MEM:7500
SRCFG[srbmem]         
HOSTLIST=node059,node060,node061,node062,node063,node064,node065,node066,node067,node068,node069,node070,node071,node072,node073
SRCFG[srbmem]         CLASSLIST=parallel_lm,serial_lm


I want jobs submitted to 'test' to go to node001,node002; jobs submitted
to 'express' to go to node003-010; jobs to 'parallel/serial' to node011-058;
and jobs to 'parallel_lm/serial_lm' (lm=large memory) to go to node059-074.

They don't ... they end up on node001,node002.  I am new to maui and perhaps
have fundamental misunderstood the SR concept ... ;-) ... that's OK if you
can set me straight, right now after the log reports success on initial 
setup
of my SR's (show here):

03/08 17:32:54 INFO:     4 feasible tasks found for job srtest.0:0 in 
partition DEFAULT (1 Needed)
03/08 17:32:54 
MJobAllocMNL(srtest.0,MFeasibleList,NodeMap,MOutList,PRIORITY,1110324774)
03/08 17:32:54 INFO:     using specified hostlist for job srtest.0
03/08 17:32:54 INFO:     hostlist node node001x2 added to job srtest.0
03/08 17:32:54 INFO:     hostlist node node002x2 added to job srtest.0
03/08 17:32:54 INFO:     4 requested hostlist tasks allocated for job 
srtest.0 (0 remain)
03/08 17:32:54 
MResCreate(User,ACL,NULL,2,NodeList,1110324774,2140000000,2,0,srtest.0,ResP,'node001 
node002',DRes)
03/08 17:32:54 INFO:     unique reservation ID 'srtest.0.0' selected
03/08 17:32:54 MResAllocate(srtest.0.0,NodeList)
03/08 17:32:54 MResAddNode(srtest.0.0,node001,2,0)
03/08 17:32:54 MRECheck(node001,MResAddNode-Start,FORCE)
03/08 17:32:54 MRECheck(node001,MResAddNode-End,FORCE)
03/08 17:32:54 INFO:     N[node001]->RE[000] S srtest.0.0(0)  00:00:00 
R: 'PROCS: 2  MEM: 7722M  SWAP: 15G  DISK: 1M'x1
03/08 17:32:54 INFO:     N[node001]->RE[001] E srtest.0.0(0)    INFINITY 
R: 'PROCS: 2  MEM: 7722M  SWAP: 15G  DISK: 1M'x1
03/08 17:32:54 MResAddNode(srtest.0.0,node002,2,0)
03/08 17:32:54 MRECheck(node002,MResAddNode-Start,FORCE)
03/08 17:32:54 MRECheck(node002,MResAddNode-End,FORCE)
03/08 17:32:54 INFO:     N[node002]->RE[000] S srtest.0.0(0)  00:00:00 
R: 'PROCS: 2  MEM: 7722M  SWAP: 15G  DISK: 1M'x1
03/08 17:32:54 INFO:     N[node002]->RE[001] E srtest.0.0(0)    INFINITY 
R: 'PROCS: 2  MEM: 7722M  SWAP: 15G  DISK: 1M'x1
03/08 17:32:54 INFO:     full SR reserved 4 procs in partition '[ALL]' 
to start in 00:00:00 at (1

When it wakes up 1:30 later, MSRSetRes is recalled and 
MReqCheckResourceMatch() fails
to include any but the first node in the HOSTLIST in its list of 
'feasible nodes'.  For node002 which
is clearly part of the SR specification for the 'srtest' reservation is 
indicates that node002 is
not in the HOSTLIST:

03/08 17:34:25 MReqCheckResourceMatch(srtest.0,0,node001,NULL)
03/08 17:34:25 INFO:     node in requested hostlist
03/08 17:34:25 MNodeCheckPolicies(srtest.0,node001,2)
03/08 17:34:25 MJobCheckNRes(srtest.0,node001,RQ[0],  
INFINITY,TCAvail,1.000,RIndex,NULL,FeasCheck)
03/08 17:34:25 MReqCheckResourceMatch(srtest.0,0,node001,RIndex)
03/08 17:34:25 INFO:     node in requested hostlist
03/08 17:34:25 INFO:     node node001 added to feasible list (2 tasks)
03/08 17:34:25 MReqCheckResourceMatch(srtest.0,0,node002,NULL)
03/08 17:34:25 INFO:     node is not in specified hostlist
03/08 17:34:25 MReqCheckResourceMatch(srtest.0,0,node003,NULL)
03/08 17:34:25 INFO:     node is not in specified hostlist

Here is the output from diagnose -r, which looks OK, except I have one 
node down (node074 which
has been left out of host list).  This is why I am asking for 32 task bu 
get only 30. 

ResID                      Type Par   StartTime     EndTime     Duration 
Node Task Proc
-----                      ---- ---   ---------     -------     -------- 
---- ---- ----
srtest.0.0                 User DEF    00:00:00    INFINITY     
INFINITY    2    2    4
    Flags: STANDINGRES
    ACL: RES==srtest.0= ACCT==root+:==mrobo+:==sko+:==shirron+:==rbw+ 
CLASS==test+ DURATION<=4:00:00+
    CL:  RES==srtest.0
    Task Resources: PROCS: 1  MEM: 3500M
    Attributes (HostList='node001 node002')
    Active PH: -0.00/0.00 (0.00%)
    SRAttributes (TaskCount: 4  StartTime: 00:00:00  EndTime: 
1:00:00:00  Days: ALL)
srexpr.0.0                 User DEF    00:00:00    INFINITY     
INFINITY    8    8   16
    Flags: STANDINGRES
    ACL: RES==srexpr.0= CLASS==express+ DURATION<=1:06:00:00+
    CL:  RES==srexpr.0
    Task Resources: PROCS: 1  MEM: 3500M
    Attributes (HostList='node003 node004 node005 node006 node007 
node008 node009 node010')
    Active PH: -0.00/0.00 (0.00%)
    SRAttributes (TaskCount: 16  StartTime: 00:00:00  EndTime: 
1:00:00:00  Days: ALL)
srsmem.0.0                 User DEF    00:00:00    INFINITY     
INFINITY   48   48   96
    Flags: STANDINGRES
    ACL: RES==srsmem.0= CLASS==parallel+:==serial+ DURATION<=16:00:00+
    CL:  RES==srsmem.0
    Task Resources: PROCS: 1  MEM: 3500M
    Attributes (HostList='node011 node012 node013 node014 node015 
node016 node017 node018 node019 node020 node021 node022 node023 node024 
node025 node026 node027 node028 node029 node030 node031 node032 node033 
node034 node035 node036 node037 node038 node039 node040 node041 node042 
node043 node044 node045 node046 node047 node048 node049 node050 node051 
node052 node053 node054 node055 node056 node057 node058')
    Active PH: -0.01/0.01 (0.00%)
    SRAttributes (TaskCount: 96  StartTime: 00:00:00  EndTime: 
1:00:00:00  Days: ALL)
srbmem.0.0                 User DEF    00:00:00    INFINITY     
INFINITY   15   15   30
    Flags: STANDINGRES
    ACL: RES==srbmem.0= CLASS==parallel_lm+:==serial_lm+ 
DURATION<=16:00:00+
    CL:  RES==srbmem.0
    Task Resources: PROCS: 1  MEM: 7500M
    Attributes (HostList='node059 node060 node061 node062 node063 
node064 node065 node066 node067 node068 node069 node070 node071 node072 
node073')
    Active PH: -0.00/0.00 (0.00%)
    SRAttributes (TaskCount: 32  StartTime: 00:00:00  EndTime: 
1:00:00:00  Days: ALL)

Ideas?  I need to get this to work correctly before we put this system 
into production.  I am running
maui-3.2.6p11and torque-1.2.0p0. 

Will send any other information needed.

Regards,


Richard Walsh
Army High Performance Computing and Research Center




More information about the mauiusers mailing list