[Mauiusers] Preemption circus

Lennart Karlsson Lennart.Karlsson at nsc.liu.se
Mon Feb 7 06:18:45 MST 2005


Hi,

We are running version 3.2.6p11 of Maui and torque_1.1.0p4.

Torque has two queues, riskjobb for fully preemptible, opportunistic,
low-priority jobs, and workq for normal jobs.

Important parts of the Maui configuration:

BACKFILLPOLICY          BESTFIT
RESERVATIONPOLICY       CURRENTHIGHEST
RESERVATIONDEPTH        300
JOBPRIOACCRUALPOLICY    FULLPOLICY
NODEALLOCATIONPOLICY    MINRESOURCE
JOBNODEMATCHPOLICY      EXACTNODE
NODEACCESSPOLICY        SINGLEJOB

SRCFG[kernelupdate] HOSTLIST=ALL
SRCFG[kernelupdate] STARTTIME=8:00:00 ENDTIME=15:00:00
SRCFG[kernelupdate] PERIOD=DAY DAYS=Thu DEPTH=8
SRCFG[kernelupdate] USERLIST=joe

PREEMPTIONPOLICY        REQUEUE
CLASSCFG[riskjobb]      QDEF=Risk
CLASSCFG[workq]         QDEF=Normal

QOSCFG[Risk]            PRIORITY=1 XFWEIGHT=1 QTWEIGHT=1 
QFLAGS=PREEMPTEE,IGNALL
QOSCFG[Normal]          PRIORITY=100000 XFWEIGHT=1000 QFLAGS=PREEMPTOR

USERCFG[user1]          MAXIJOB=10 MAXPROC=8
USERCFG[user2]          MAXIJOB=10 MAXPROC=8


Before we introduced the Standing Reservation on Thursday, this worked nicely,
with riskjobbs filling up most of the free nodes not used by the normal (workq)
jobs, except for one thing:

Problem 1: When Maui had sent the requeue request to Torque to preempt a
riskjobb, it immediately afterwards send a request to Torque to start the
normal job and got no answer. I guess that this is a timing problem
between Maui and Torque, because it helped to put a "sleep (10);"
after the requeuing in the Maui code, although this of course made Maui
somewhat unresponsive to e.g. showq commands.

When we introduced the Standing Reservation, we found out another problem,
called

Problem 2: If we queue a normal (workq) job which is too long to run to
completion before the system reservation on Thursday, Maui of course tries
to schedule it.

Maui goes through all nodes trying to arrange for an immediate start of the
normal one-node job (nr 10716), according to the following Maui log
(log level 9):

02/06 00:41:57 INFO:     checking job '10716'
02/06 00:41:57 INFO:     checking job 10716(1)  state: Idle (ex: Idle)
02/06 00:41:57 INFO:     node n3 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n4 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n5 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n6 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n8 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n10 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n11 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n12 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n13 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n14 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n15 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n16 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n18 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n19 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n20 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n21 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n22 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n23 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n24 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n25 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n27 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n28 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n29 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n30 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     node n31 unavailable for job 10716 at 00:00:00
02/06 00:41:57 INFO:     inadequate nodes found for job 10716:0 (0 < 1)
02/06 00:41:57 INFO:     new preemptible job 10718 located on node n33 (3 < 
101044)
02/06 00:41:57 INFO:     node n33 unavailable for job 10716 at 00:00:01
02/06 00:41:57 INFO:     adequate tasks (P+I=25+0) located for job 10716
02/06 00:41:57 INFO:     preemptible job 10718 provides 1/1 tasks/nodes
02/06 00:41:57 INFO:     job '10718' requeued
02/06 00:42:07 INFO:     hostlist for job '10718' set to '1'
02/06 00:42:07 INFO:     job '10718'  hash 3491
02/06 00:42:07 INFO:     job '10718' found at hash[3491] 72 '10718' (J->Name: 
10718)
02/06 00:42:07 INFO:     job '10718' reservation released (tasks requested: 1)
02/06 00:42:07 INFO:     job flags for job 10718: 1800
02/06 00:42:07 INFO:     attribute 'PREEMPTEE' set for job 10718
02/06 00:42:07 INFO:     1(0) tasks/1(0) nodes found for job 10716 in 
MJobSelectMNL
02/06 00:42:07 INFO:     resources found for job 10716 tasks: 1+0 of 1  nodes: 
1+0 of 0
02/06 00:42:07 ERROR:    cannot allocate nodes to job '10716' in partition 
DEFAULT
02/06 00:42:07 INFO:     system min start time set on job 10716 for 00:00:01
02/06 00:42:07 INFO:     adequate policy slot located at time 00:00:01 for job 
10716

Maui makes the mistake to preempt the running riskjobb (nr 10718), even though
it cannot use its node (n33) due to the system reservation.

This has the unfortunate effect that a number of nodes (as wide as the sum
of how wide the queued normal jobs are) cannot be used by riskjobbs and
worse: A number of riskjobbs are requeued and rerun for each scheduling cycle.

Our proposal is that Maui should check if it can use any part of the
reservation of a preemptible jobb, before Maui preempts the job.

Other comments to our configuration are also welcome. We had serious
problems getting the rules for one CLASS to be independant of the rules for the
other CLASS, even though we wanted them to work with preemption as the only
interaction. (E.g. does the MAXPROC=8 rule for user1 count in the sum of
processors used by her/his running riskjobb AND workq jobs, but we would have
much preferred the rule to not count in the already running riskjobb jobs of
user1.)

Best regards,
-- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
   National Supercomputer Centre in Linkoping, Sweden
   http://www.nsc.liu.se
   +46 706 49 55 35
   +46 13 28 26 24




More information about the mauiusers mailing list