[Mauiusers] More information on standing reservation problem ...
Richard Walsh
rbw at ahpcrc.org
Tue Mar 8 16:54:58 MST 2005
All,
Still having a problem getting my configured standing reservations to
control nodes allocated to the jobs submitted to a particular queue.
Here is the SR piece of my maui.config file (looks ok to me):
SRCFG[srtest] PERIOD=INFINITY
SRCFG[srtest] DAYS=ALL
SRCFG[srtest] TIMELIMIT=4:00:00
SRCFG[srtest] TASKCOUNT=4 RESOURCES=PROCS:1;MEM:3500
SRCFG[srtest] ACCOUNTLIST=root,mrobo,sko,shirron,rbw
SRCFG[srtest] HOSTLIST=node001,node002
SRCFG[srtest] CLASSLIST=test
SRCFG[srexpr] PERIOD=INFINITY
SRCFG[srexpr] DAYS=ALL
SRCFG[srexpr] TIMELIMIT=30:00
SRCFG[srexpr] TASKCOUNT=16 RESOURCES=PROCS:1;MEM:3500
SRCFG[srexpr]
HOSTLIST=node003,node004,node005,node006,node007,node008,node009,node010
SRCFG[srexpr] CLASSLIST=express
SRCFG[srsmem] PERIOD=INFINITY
SRCFG[srsmem] DAYS=ALL
SRCFG[srsmem] TIMELIMIT=16:00:00
SRCFG[srsmem] TASKCOUNT=96 RESOURCES=PROCS:1;MEM:3500
SRCFG[srsmem]
HOSTLIST=node011,node012,node013,node014,node015,node016,node017,node018,node019,node020,node021,node022,node023,node024,node025,node026,node027,node028,node029,node030,node031,node032,node033,node034,node035,node036,node037,node038,node039,node040,node041,node042,node043,node044,node045,node046,node047,node048,node049,node050,node051,node052,node053,node054,node055,node056,node057,node058
SRCFG[srsmem] CLASSLIST=parallel,serial
SRCFG[srbmem] PERIOD=INFINITY
SRCFG[srbmem] DAYS=ALL
SRCFG[srbmem] TIMELIMIT=16:00:00
SRCFG[srbmem] TASKCOUNT=32 RESOURCES=PROCS:1;MEM:7500
SRCFG[srbmem]
HOSTLIST=node059,node060,node061,node062,node063,node064,node065,node066,node067,node068,node069,node070,node071,node072,node073
SRCFG[srbmem] CLASSLIST=parallel_lm,serial_lm
I want jobs submitted to 'test' to go to node001,node002; jobs submitted
to 'express' to go to node003-010; jobs to 'parallel/serial' to node011-058;
and jobs to 'parallel_lm/serial_lm' (lm=large memory) to go to node059-074.
They don't ... they end up on node001,node002. I am new to maui and perhaps
have fundamental misunderstood the SR concept ... ;-) ... that's OK if you
can set me straight, right now after the log reports success on initial
setup
of my SR's (show here):
03/08 17:32:54 INFO: 4 feasible tasks found for job srtest.0:0 in
partition DEFAULT (1 Needed)
03/08 17:32:54
MJobAllocMNL(srtest.0,MFeasibleList,NodeMap,MOutList,PRIORITY,1110324774)
03/08 17:32:54 INFO: using specified hostlist for job srtest.0
03/08 17:32:54 INFO: hostlist node node001x2 added to job srtest.0
03/08 17:32:54 INFO: hostlist node node002x2 added to job srtest.0
03/08 17:32:54 INFO: 4 requested hostlist tasks allocated for job
srtest.0 (0 remain)
03/08 17:32:54
MResCreate(User,ACL,NULL,2,NodeList,1110324774,2140000000,2,0,srtest.0,ResP,'node001
node002',DRes)
03/08 17:32:54 INFO: unique reservation ID 'srtest.0.0' selected
03/08 17:32:54 MResAllocate(srtest.0.0,NodeList)
03/08 17:32:54 MResAddNode(srtest.0.0,node001,2,0)
03/08 17:32:54 MRECheck(node001,MResAddNode-Start,FORCE)
03/08 17:32:54 MRECheck(node001,MResAddNode-End,FORCE)
03/08 17:32:54 INFO: N[node001]->RE[000] S srtest.0.0(0) 00:00:00
R: 'PROCS: 2 MEM: 7722M SWAP: 15G DISK: 1M'x1
03/08 17:32:54 INFO: N[node001]->RE[001] E srtest.0.0(0) INFINITY
R: 'PROCS: 2 MEM: 7722M SWAP: 15G DISK: 1M'x1
03/08 17:32:54 MResAddNode(srtest.0.0,node002,2,0)
03/08 17:32:54 MRECheck(node002,MResAddNode-Start,FORCE)
03/08 17:32:54 MRECheck(node002,MResAddNode-End,FORCE)
03/08 17:32:54 INFO: N[node002]->RE[000] S srtest.0.0(0) 00:00:00
R: 'PROCS: 2 MEM: 7722M SWAP: 15G DISK: 1M'x1
03/08 17:32:54 INFO: N[node002]->RE[001] E srtest.0.0(0) INFINITY
R: 'PROCS: 2 MEM: 7722M SWAP: 15G DISK: 1M'x1
03/08 17:32:54 INFO: full SR reserved 4 procs in partition '[ALL]'
to start in 00:00:00 at (1
When it wakes up 1:30 later, MSRSetRes is recalled and
MReqCheckResourceMatch() fails
to include any but the first node in the HOSTLIST in its list of
'feasible nodes'. For node002 which
is clearly part of the SR specification for the 'srtest' reservation is
indicates that node002 is
not in the HOSTLIST:
03/08 17:34:25 MReqCheckResourceMatch(srtest.0,0,node001,NULL)
03/08 17:34:25 INFO: node in requested hostlist
03/08 17:34:25 MNodeCheckPolicies(srtest.0,node001,2)
03/08 17:34:25 MJobCheckNRes(srtest.0,node001,RQ[0],
INFINITY,TCAvail,1.000,RIndex,NULL,FeasCheck)
03/08 17:34:25 MReqCheckResourceMatch(srtest.0,0,node001,RIndex)
03/08 17:34:25 INFO: node in requested hostlist
03/08 17:34:25 INFO: node node001 added to feasible list (2 tasks)
03/08 17:34:25 MReqCheckResourceMatch(srtest.0,0,node002,NULL)
03/08 17:34:25 INFO: node is not in specified hostlist
03/08 17:34:25 MReqCheckResourceMatch(srtest.0,0,node003,NULL)
03/08 17:34:25 INFO: node is not in specified hostlist
Here is the output from diagnose -r, which looks OK, except I have one
node down (node074 which
has been left out of host list). This is why I am asking for 32 task bu
get only 30.
ResID Type Par StartTime EndTime Duration
Node Task Proc
----- ---- --- --------- ------- --------
---- ---- ----
srtest.0.0 User DEF 00:00:00 INFINITY
INFINITY 2 2 4
Flags: STANDINGRES
ACL: RES==srtest.0= ACCT==root+:==mrobo+:==sko+:==shirron+:==rbw+
CLASS==test+ DURATION<=4:00:00+
CL: RES==srtest.0
Task Resources: PROCS: 1 MEM: 3500M
Attributes (HostList='node001 node002')
Active PH: -0.00/0.00 (0.00%)
SRAttributes (TaskCount: 4 StartTime: 00:00:00 EndTime:
1:00:00:00 Days: ALL)
srexpr.0.0 User DEF 00:00:00 INFINITY
INFINITY 8 8 16
Flags: STANDINGRES
ACL: RES==srexpr.0= CLASS==express+ DURATION<=1:06:00:00+
CL: RES==srexpr.0
Task Resources: PROCS: 1 MEM: 3500M
Attributes (HostList='node003 node004 node005 node006 node007
node008 node009 node010')
Active PH: -0.00/0.00 (0.00%)
SRAttributes (TaskCount: 16 StartTime: 00:00:00 EndTime:
1:00:00:00 Days: ALL)
srsmem.0.0 User DEF 00:00:00 INFINITY
INFINITY 48 48 96
Flags: STANDINGRES
ACL: RES==srsmem.0= CLASS==parallel+:==serial+ DURATION<=16:00:00+
CL: RES==srsmem.0
Task Resources: PROCS: 1 MEM: 3500M
Attributes (HostList='node011 node012 node013 node014 node015
node016 node017 node018 node019 node020 node021 node022 node023 node024
node025 node026 node027 node028 node029 node030 node031 node032 node033
node034 node035 node036 node037 node038 node039 node040 node041 node042
node043 node044 node045 node046 node047 node048 node049 node050 node051
node052 node053 node054 node055 node056 node057 node058')
Active PH: -0.01/0.01 (0.00%)
SRAttributes (TaskCount: 96 StartTime: 00:00:00 EndTime:
1:00:00:00 Days: ALL)
srbmem.0.0 User DEF 00:00:00 INFINITY
INFINITY 15 15 30
Flags: STANDINGRES
ACL: RES==srbmem.0= CLASS==parallel_lm+:==serial_lm+
DURATION<=16:00:00+
CL: RES==srbmem.0
Task Resources: PROCS: 1 MEM: 7500M
Attributes (HostList='node059 node060 node061 node062 node063
node064 node065 node066 node067 node068 node069 node070 node071 node072
node073')
Active PH: -0.00/0.00 (0.00%)
SRAttributes (TaskCount: 32 StartTime: 00:00:00 EndTime:
1:00:00:00 Days: ALL)
Ideas? I need to get this to work correctly before we put this system
into production. I am running
maui-3.2.6p11and torque-1.2.0p0.
Will send any other information needed.
Regards,
Richard Walsh
Army High Performance Computing and Research Center
More information about the mauiusers
mailing list