[Mauiusers] Reservation not behaving as expected

Vicker, Darby (JSC-EG311) darby.vicker-1 at nasa.gov
Mon Sep 26 07:23:26 MDT 2011


Hello,

I could use some help figuring out why a reservation was preventing jobs from running in a situation where they should have run.  I'm using torque 2.3.6 and maui 3.2.6p21.   Our cluster is an SGI ICE system so the node names look like rXiYnZZ where X is the rack number, Y is the IRU number (from 0-3) and ZZ is the node number in the IRU (from 0-15).  We have 2 fully populated racks so 8 IRU's/128 nodes total.  I wanted to give some dedicated time to a few users for 75% of the machine for a 7 hours, which I did with the following command:


service0:~ # setres -n DAC -s 11:00_09/25  -d 0:07:00:00 -u lumpkin:lebeau:kboyles:bstewart 'r1i[0-3]n[0-9]|r2i[0-1]n[0-9]'

reservation created


reservation 'testing.0' created on 96 nodes (1152 tasks)
r1i0n0:1
r1i0n1:1
r1i0n2:1
r1i0n3:1
r1i0n4:1
<clip>



All seemed well at this point.  The DAC reservation should have reserved 96 nodes, leaving 32 left for other users.  However, when the reservation took effect yesterday there were several jobs that did not run that should have.  There were no other reservations or other policies in effect that should have prevented jobs from running.  We aren't using any QOS either - its pretty much a FIFO queue with some soft and hard limits on the number of jobs and number of procs for each user.  

All of the commands below were taken just after the DAC reservation was in effect.  The first job in the queue (83219) should have run - there should have been 32 nodes free.  For some reason maui did not run it.  But after that all 3 of the 8-node jobs should have run.  Looking at the "checkjob -v 83224" output, maui thinks that essentially all the nodes were reserved (except for 8 nodes from 83223).  

Any idea what might be going on here?

Thanks,
Darby





service0:~ # qstat -a

                                                              Req'd   Elap
Job ID               Username Queue    Jobname          NDS   Time  S Time
-------------------- -------- -------- ---------------- ----- ----- - -----
83219                aschwing huge     m0.40a0.00_SAES     32 04:00 Q   -- 
83223                stuart   medium   m0.27a30.0b20.0      8 04:00 R 01:55
83224                stuart   medium   m0.27a0.0b20.0       8 04:00 Q   -- 
83225                stuart   medium   m0.27-30.0b20.0      8 04:00 Q   -- 
service0:~ # checkjob -v 83224


checking job 83224 (RM job '83224.service0')

State: Idle
Creds:  user:stuart  group:eg3  class:medium  qos:DEFAULT
WallTime: 00:00:00 of 4:00:00
SubmitTime: Sun Sep 25 10:51:08
  (Time Queued  Total: 00:17:40  Eligible: 00:17:32)

Total Tasks: 96

Req[0]  TaskCount: 96  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Exec:  ''  ExecSize: 0  ImageSize: 0
Dedicated Resources Per Task: PROCS: 1
NodeAccess: SINGLEUSER
TasksPerNode: 12  NodeCount: 8


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
SystemQueueTime: Sun Sep 25 10:51:16

PE:  96.00  StartPriority:  17
job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 96 procs found)
idle procs: 1440  feasible procs:   0

Rejection Reasons: [State        :    8][ReserveTime  :  120]

Detailed Node Availability Information:

r1i0n0                   rejected : ReserveTime
r1i0n1                   rejected : ReserveTime
r1i0n2                   rejected : ReserveTime
r1i0n3                   rejected : ReserveTime
r1i0n4                   rejected : ReserveTime
r1i0n5                   rejected : ReserveTime
r1i0n6                   rejected : ReserveTime
r1i0n7                   rejected : ReserveTime
r1i0n8                   rejected : ReserveTime
r1i0n9                   rejected : ReserveTime
r1i0n10                  rejected : ReserveTime
r1i0n11                  rejected : ReserveTime
r1i0n12                  rejected : ReserveTime
r1i0n13                  rejected : ReserveTime
r1i0n14                  rejected : ReserveTime
r1i0n15                  rejected : ReserveTime
r1i1n0                   rejected : ReserveTime
r1i1n1                   rejected : ReserveTime
r1i1n2                   rejected : ReserveTime
r1i1n3                   rejected : ReserveTime
r1i1n4                   rejected : ReserveTime
r1i1n5                   rejected : ReserveTime
r1i1n6                   rejected : ReserveTime
r1i1n7                   rejected : ReserveTime
r1i1n8                   rejected : ReserveTime
r1i1n9                   rejected : ReserveTime
r1i1n10                  rejected : ReserveTime
r1i1n11                  rejected : ReserveTime
r1i1n12                  rejected : ReserveTime
r1i1n13                  rejected : ReserveTime
r1i1n14                  rejected : ReserveTime
r1i1n15                  rejected : ReserveTime
r1i2n0                   rejected : ReserveTime
r1i2n1                   rejected : ReserveTime
r1i2n2                   rejected : ReserveTime
r1i2n3                   rejected : ReserveTime
r1i2n4                   rejected : ReserveTime
r1i2n5                   rejected : ReserveTime
r1i2n6                   rejected : ReserveTime
r1i2n7                   rejected : ReserveTime
r1i2n8                   rejected : ReserveTime
r1i2n9                   rejected : ReserveTime
r1i2n10                  rejected : ReserveTime
r1i2n11                  rejected : ReserveTime
r1i2n12                  rejected : ReserveTime
r1i2n13                  rejected : ReserveTime
r1i2n14                  rejected : ReserveTime
r1i2n15                  rejected : ReserveTime
r1i3n0                   rejected : ReserveTime
r1i3n1                   rejected : ReserveTime
r1i3n2                   rejected : ReserveTime
r1i3n3                   rejected : ReserveTime
r1i3n4                   rejected : ReserveTime
r1i3n5                   rejected : ReserveTime
r1i3n6                   rejected : ReserveTime
r1i3n7                   rejected : ReserveTime
r1i3n8                   rejected : ReserveTime
r1i3n9                   rejected : ReserveTime
r1i3n10                  rejected : ReserveTime
r1i3n11                  rejected : ReserveTime
r1i3n12                  rejected : ReserveTime
r1i3n13                  rejected : ReserveTime
r1i3n14                  rejected : ReserveTime
r1i3n15                  rejected : ReserveTime
r2i0n0                   rejected : ReserveTime
r2i0n1                   rejected : ReserveTime
r2i0n2                   rejected : ReserveTime
r2i0n3                   rejected : ReserveTime
r2i0n4                   rejected : ReserveTime
r2i0n5                   rejected : ReserveTime
r2i0n6                   rejected : ReserveTime
r2i0n7                   rejected : ReserveTime
r2i0n8                   rejected : ReserveTime
r2i0n9                   rejected : ReserveTime
r2i0n10                  rejected : ReserveTime
r2i0n11                  rejected : ReserveTime
r2i0n12                  rejected : ReserveTime
r2i0n13                  rejected : ReserveTime
r2i0n14                  rejected : ReserveTime
r2i0n15                  rejected : ReserveTime
r2i1n0                   rejected : ReserveTime
r2i1n1                   rejected : ReserveTime
r2i1n2                   rejected : ReserveTime
r2i1n3                   rejected : ReserveTime
r2i1n4                   rejected : ReserveTime
r2i1n5                   rejected : ReserveTime
r2i1n6                   rejected : ReserveTime
r2i1n7                   rejected : ReserveTime
r2i1n8                   rejected : ReserveTime
r2i1n9                   rejected : ReserveTime
r2i1n10                  rejected : ReserveTime
r2i1n11                  rejected : ReserveTime
r2i1n12                  rejected : ReserveTime
r2i1n13                  rejected : ReserveTime
r2i1n14                  rejected : ReserveTime
r2i1n15                  rejected : ReserveTime
r2i2n0                   rejected : ReserveTime
r2i2n1                   rejected : ReserveTime
r2i2n2                   rejected : ReserveTime
r2i2n3                   rejected : ReserveTime
r2i2n4                   rejected : ReserveTime
r2i2n5                   rejected : ReserveTime
r2i2n6                   rejected : ReserveTime
r2i2n7                   rejected : ReserveTime
r2i2n8                   rejected : State
r2i2n9                   rejected : State
r2i2n10                  rejected : State
r2i2n11                  rejected : State
r2i2n12                  rejected : State
r2i2n13                  rejected : State
r2i2n14                  rejected : State
r2i2n15                  rejected : State
r2i3n0                   rejected : ReserveTime
r2i3n1                   rejected : ReserveTime
r2i3n2                   rejected : ReserveTime
r2i3n3                   rejected : ReserveTime
r2i3n4                   rejected : ReserveTime
r2i3n5                   rejected : ReserveTime
r2i3n6                   rejected : ReserveTime
r2i3n7                   rejected : ReserveTime
r2i3n8                   rejected : ReserveTime
r2i3n9                   rejected : ReserveTime
r2i3n10                  rejected : ReserveTime
r2i3n11                  rejected : ReserveTime
r2i3n12                  rejected : ReserveTime
r2i3n13                  rejected : ReserveTime
r2i3n14                  rejected : ReserveTime
r2i3n15                  rejected : ReserveTime

service0:~ # checknode r2i3n0


checking node r2i3n0

State:      Idle  (in current state for 00:01:02)
Configured Resources: PROCS: 12  MEM: 23G  SWAP: 23G  DISK: 1M
Utilized   Resources: [NONE]
Dedicated  Resources: [NONE]
Opsys:         linux  Arch:      [NONE]
Speed:      1.00  Load:       0.000
Network:    [DEFAULT]
Features:   [NONE]
Attributes: [Batch]
Classes:    [ginormous 12:12][debug 12:12][large 12:12][huge 12:12][medium 12:12][route 12:12][small 12:12][super 12:12][tiny 12:12]

Total Time:   INFINITY  Up:   INFINITY (99.93%)  Active:   INFINITY (82.06%)

Reservations:
  Job '83219'(x12)  2:03:58 -> 6:03:58 (4:00:00)

service0:~ # checknode r1i0n0


checking node r1i0n0

State:      Idle  (in current state for 00:01:33)
Configured Resources: PROCS: 12  MEM: 23G  SWAP: 23G  DISK: 1M
Utilized   Resources: [NONE]
Dedicated  Resources: [NONE]
Opsys:         linux  Arch:      [NONE]
Speed:      1.00  Load:       0.000
Network:    [DEFAULT]
Features:   [NONE]
Attributes: [Batch]
Classes:    [ginormous 12:12][debug 12:12][large 12:12][huge 12:12][medium 12:12][route 12:12][small 12:12][super 12:12][tiny 12:12]

Total Time:   INFINITY  Up:   INFINITY (99.79%)  Active: 77:16:44:11 (19.63%)

Reservations:
  User 'DAC.0'(x1)  -00:09:50 -> 6:50:10 (7:00:00)
    Blocked Resources at -00:09:50   Procs: 12/12 (100.00%)

service0:~ # diagnose -r
Diagnosing Reservations
ResID                      Type Par   StartTime     EndTime     Duration Node Task Proc
-----                      ---- ---   ---------     -------     -------- ---- ---- ----
DAC.0                      User DEF   -00:10:21     6:49:39      7:00:00   96   96 1152
    Flags: PREEMPTEE
    ACL: RES==DAC.0= USER==lumpkin+:==lebeau+:==kboyles+:==bstewart+ 
    CL:  RES==DAC.0 
    Task Resources: PROCS: [ALL]
    Attributes (HostList='r1i[0-3]n[0-9]|r2i[0-1]n[0-9]')
    Active PH: 0.00/202.16 (0.00%)
83223                       Job DEF    -1:57:04     2:02:56      4:00:00    8   96   96
    ACL: JOB==83223= 
    CL:  JOB==83223 USER==stuart GROUP==eg3 CLASS==medium QOS==DEFAULT DURATION==4:00:00 PROC==96 
debug.1.0                  User DEF    20:49:39  1:05:49:39      9:00:00    8    8   96
    Flags: STANDINGRES SHARED
    ACL: RES==debug.1= CLASS==debug+ 
    CL:  RES==debug.1 
    Task Resources: PROCS: [ALL]
    Attributes (HostList='r2i3n8 r2i3n9 r2i3n10 r2i3n11 r2i3n12 r2i3n13 r2i3n14 r2i3n15')
    SRAttributes (TaskCount: 8  StartTime: 8:00:00  EndTime: 17:00:00  Days: Mon,Tue,Wed,Thu,Fri)
83219                       Job DEF     2:02:56     6:02:56      4:00:00   32  384  384
    Flags: PREEMPTEE
    ACL: JOB==83219= 
    CL:  JOB==83219 USER==aschwing GROUP==eg3 CLASS==huge QOS==DEFAULT DURATION==4:00:00 PROC==384 
    Attributes (Priority=56)

Active Reserved Processors: 96

service0:~ #


More information about the mauiusers mailing list