[Mauiusers] Log analysis/Slipping Reservations
Ansgar Esztermann
aeszter at gwdg.de
Tue May 5 08:01:35 MDT 2009
I have noticed that low-priority, backfilled jobs sometimes seem to
prevent high-priority jobs from starting. Typically, things look like
this:
05/05 09:51:27 INFO: job '99468' reserved 64 tasks (partition IBA)
to start in 00:19:47 on Tue May 5 10:11:12
05/05 09:51:40 INFO: job '99468' reserved 64 tasks (partition IBA)
to start in 1:00:00 on Tue May 5 10:51:38
Job 99468 has top priority, so it already has a reservation in place.
But somehow, this reservation tends to slip from time to time.
I have turned up the LOGLEVEL and LOGFILEROLLDEPTH parameters, but the
reason for this behaviour still eludes me.
This is the (hopefully) relevant part of the logfile in all its glory.
I have interspersed comments with my interpretation of what happens:
05/05 09:51:40 INFO: attempting to get resources for 99468 8 * (P:
1 M: 0 S: 0 D: 0)
05/05 09:51:40 MRECheck(node023,MJobGetSNRange-Start,FORCE)
05/05 09:51:40 INFO: N[node023]->RE[000] S 88801(2) -00:06:18 R:
'PROCS: 1'x2
05/05 09:51:40 INFO: N[node023]->RE[001] S 88802(3) -00:06:18 R:
'PROCS: 1'x2
05/05 09:51:40 INFO: N[node023]->RE[002] S 88807(4) -00:05:53 R:
'PROCS: 1'x2
05/05 09:51:40 INFO: N[node023]->RE[003] E 88801(2) 00:13:42 R:
'PROCS: 1'x2
05/05 09:51:40 INFO: N[node023]->RE[004] E 88802(3) 00:13:42 R:
'PROCS: 1'x2
05/05 09:51:40 INFO: N[node023]->RE[005] E 88807(4) 00:14:07 R:
'PROCS: 1'x2
05/05 09:51:40 INFO: N[node023]->RE[006] S SYSTEM.0(0)
23:05:08:22 R: 'PROCS: 8'x1
05/05 09:51:40 INFO: N[node023]->RE[007] E SYSTEM.0(0)
23:13:08:22 R: 'PROCS: 8'x1
Node023 will be unavailable at 10:11:12, leading to the delay (see
below). Right now, everything looks fine: there are three reservations
ending well before 10:11, plus one that starts in 23 days. Since job
99468 only requires 7 days of walltime, that should not be a problem.
05/05 09:51:40 INFO: initial resources: PROCS: 8 MEM: 7986M
SWAP: 14G DISK: 1M
05/05 09:51:40 INFO: requested resources: PROCS: 1 (x8)
05/05 09:51:40 INFO: non-user reservation[2] '88801'x2 resources
PROCS: 1
05/05 09:51:40 INFO: checking reservation 88801 start at time
-00:06:18 (O: 822)
05/05 09:51:40 MResCheckJAccess(88801,99468,822,Same,Affinity)
05/05 09:51:40 INFO: exclusive (two job reservations)
05/05 09:51:40 INFO: reservations are exclusive
05/05 09:51:40 INFO: removed resources: PROCS: 1 (x2)
05/05 09:51:40 INFO: resulting resources: PROCS: [ALL] MEM:
7986M SWAP: 14G DISK: 1M
05/05 09:51:40 INFO: non-user reservation[3] '88802'x2 resources
PROCS: 1
05/05 09:51:40 INFO: checking reservation 88802 start at time
-00:06:18 (O: 822)
05/05 09:51:40 MResCheckJAccess(88802,99468,822,Same,Affinity)
05/05 09:51:40 INFO: exclusive (two job reservations)
05/05 09:51:40 INFO: reservations are exclusive
05/05 09:51:40 INFO: removed resources: PROCS: 1 (x2)
05/05 09:51:40 INFO: resulting resources: PROCS: [ALL] MEM:
7986M SWAP: 14G DISK: 1M
05/05 09:51:40 INFO: verifying resource access at -00:06:18 (1)
05/05 09:51:40 INFO: non-user reservation[4] '88807'x2 resources
PROCS: 1
05/05 09:51:40 INFO: checking reservation 88807 start at time
-00:05:53 (O: 847)
05/05 09:51:40 MResCheckJAccess(88807,99468,847,Same,Affinity)
05/05 09:51:40 INFO: exclusive (two job reservations)
05/05 09:51:40 INFO: reservations are exclusive
05/05 09:51:40 INFO: removed resources: PROCS: 1 (x2)
05/05 09:51:40 INFO: resulting resources: PROCS: [ALL] MEM:
7986M SWAP: 14G DISK: 1M
05/05 09:51:40 INFO: verifying resource access at -00:05:53 (2)
05/05 09:51:40 INFO: performing starttime check (3)
05/05 09:51:40 INFO: resources unavailable at time 00:00:00
There are various reservations extant, so we cannot start right now.
Thus, look at each reservation's end in turn:
05/05 09:51:40 INFO: non-user reservation[2] '88801'x2 resources
PROCS: 1
05/05 09:51:40 INFO: checking reservation 88801 end at time
00:13:42 (O: 822)
05/05 09:51:40 MResCheckJAccess(88801,99468,822,Same,Affinity)
05/05 09:51:40 INFO: exclusive (two job reservations)
05/05 09:51:40 INFO: reservations are exclusive
05/05 09:51:40 INFO: added resources: PROCS: 1 (x2)
05/05 09:51:40 INFO: resulting resources: PROCS: [ALL] MEM:
7986M SWAP: 14G DISK: 1M
05/05 09:51:40 INFO: non-user reservation[3] '88802'x2 resources
PROCS: 1
05/05 09:51:40 INFO: checking reservation 88802 end at time
00:13:42 (O: 822)
05/05 09:51:40 MResCheckJAccess(88802,99468,822,Same,Affinity)
05/05 09:51:40 INFO: exclusive (two job reservations)
05/05 09:51:40 INFO: reservations are exclusive
05/05 09:51:40 INFO: added resources: PROCS: 1 (x2)
05/05 09:51:40 INFO: resulting resources: PROCS: [ALL] MEM:
7986M SWAP: 14G DISK: 1M
05/05 09:51:40 INFO: verifying resource access at 00:13:42 (4)
05/05 09:51:40 INFO: non-user reservation[4] '88807'x2 resources
PROCS: 1
05/05 09:51:40 INFO: checking reservation 88807 end at time
00:14:07 (O: 847)
05/05 09:51:40 MResCheckJAccess(88807,99468,847,Same,Affinity)
05/05 09:51:40 INFO: exclusive (two job reservations)
05/05 09:51:40 INFO: reservations are exclusive
05/05 09:51:40 INFO: added resources: PROCS: 1 (x2)
05/05 09:51:40 INFO: resulting resources: MEM: 7986M SWAP: 14G
DISK: 1M
05/05 09:51:40 INFO: verifying resource access at 00:14:07 (5)
05/05 09:51:40 INFO: resources available at time 00:14:07 during
88807 end
As reservations end, more CPUs become available. At 09:51:40 +
00:14:07, the node should be free.
05/05 09:51:40 MNodeGetTC(node023,0,8,8,1,0)
05/05 09:51:40 INFO: TC from PROC: 0
05/05 09:51:40 INFO: ARange[0] adjusted (TC: 8 S: 1241510745)
1241510745 = 10:05:45 CEST
05/05 09:51:40 INFO: user reservation[0] 'SYSTEM.0'x1 resources
PROCS: 8
05/05 09:51:40 INFO: checking reservation SYSTEM.0 start at time
23:05:08:22 (O: 28800)
05/05 09:51:40 MResCheckJAccess(SYSTEM.0,99468,28800,Same,Affinity)
05/05 09:51:40 INFO: reservations are exclusive
05/05 09:51:40 INFO: removed resources: PROCS: 8 (x1)
05/05 09:51:40 INFO: resulting resources: PROCS: [ALL] MEM:
7986M SWAP: 14G DISK: 1M
05/05 09:51:40 INFO: verifying resource access at 23:05:08:22 (6)
05/05 09:51:40 INFO: resources unavailable at time 23:05:08:22
during reservation SYSTEM.0 start (resources)
05/05 09:51:40 INFO: closing ARange[0] (00:14:07 -> 23:05:08:22 :
8) (1)
After 23 days, the SYSGTEM.0 reservation kicks in, making the node
unaccessible again, so we have from now + 0:14:07 to now + 23 days for
our job.
05/05 09:51:40 INFO: user reservation[0] 'SYSTEM.0'x1 resources
PROCS: 8
05/05 09:51:40 INFO: checking reservation SYSTEM.0 end at time
23:13:08:22 (O: 28800)
05/05 09:51:40 MResCheckJAccess(SYSTEM.0,99468,28800,Same,Affinity)
05/05 09:51:40 INFO: reservations are exclusive
05/05 09:51:40 INFO: added resources: PROCS: 8 (x1)
05/05 09:51:40 INFO: resulting resources: MEM: 7986M SWAP: 14G
DISK: 1M
05/05 09:51:40 INFO: verifying resource access at 23:13:08:22 (7)
05/05 09:51:40 INFO: resources available at time 23:13:08:22
during SYSTEM.0 end
After SYSTEM.0 ends, the nodes is available again.
05/05 09:51:40 MNodeGetTC(node023,0,8,8,1,0)
05/05 09:51:40 INFO: TC from PROC: 0
05/05 09:51:40 INFO: ARange[1] adjusted (TC: 8 S: 1243544400)
05/05 09:51:40 INFO: closing ARange[1] (23:13:08:22 ->
INFINITY : 8) (4)
... forever. 1243544400 = May 28 23:00 CEST.
05/05 09:51:40 INFO: node node023 supports 8 tasks of job 99468:0
for 23:04:08:22 at 1:00:00
05/05 09:51:40 INFO: node node023 supports 8 tasks of job 99468:0
for INFINITY at 23:13:08:22
This, I do not understand. Why 1:00:00 rather than 0:14:07?
There are hundreds of low-priority jobs in the queue; in this case,
they need 2 CPUs and 20 minutes, so they will jump into the gap
between 0:14:07 and 1:00:00, preventing job 99468 from starting
earlier. Since the scenario presented above may repeat quite a few
times, high-priority jobs sometimes take days to start even though
free nodes are, in principle, available.
I'd be grateful if anyone can shed some light on why this is happening.
A.
--
Ansgar Esztermann
DV-Systemadministration
Max-Planck-Institut für biophysikalische Chemie, Abteilung 105
More information about the mauiusers
mailing list