[torquedev] Preemption race with Torque and Maui
ERoman at lbl.gov
Mon Dec 6 11:59:46 MST 2010
I've been playing around with Torque and Maui using BLCR to do
checkpoint-based preemption. When Maui decides to preempt a low
priority job, I see a few events on the server. First
* a hold request is issued on the preemptee job, followed by
* a release request on the preemptee, and finally a
* run request for the preemptor.
The last run request fails, because the initial hold request hasn't completed.
Maui checks that the hold request was issued, but never waits for the job to
exit and enter a held state. In other words, it looks like Maui assumes
that the hold completes immediately.
I see three ways to fix this.
1. The easy way is to poll the job state (or wait, if that's possible) in
Maui's preemption code, to block Maui from issuing the job request until the
state changes. Some sort of timeout will be necessary so we don't block the
server forever. This may not be desirable, since it looks like the scheduler
won't be able to service any requests while the checkpoint takes place.
The easiest way to code this is to put a call to pbs_statjob immediately after
the call to pbs_holdjob, and spin/wait until the job state changes to H.
2. The alternative is to issue the hold (and maybe the release), and terminate
the scheduler loop. When the job enters the hold state, we can (assuming
the Torque server notifies the scheduler) wake up the scheduler again, and then
schedule the job to run.
I'm not clear on how to implement this. A simple-minded thing to do is request
the hold, but change Maui's preemption routine to return a (fake) failure for
the preemption request. Later, when the hold completes, the server should wake
up the scheduler again with freed resources.
3. Tell the Torque server (and mom) to allow 2 jobs to be run on the node.
Then the Run request will succeed. I'm not sure if this is possible.
There's some old code for parallel jobs on shared nodes, and other code for
time-shared nodes, but I don't know how well this works, or whether we'd want
to spawn new jobs while checkpoints are taking place.
What do you think?
The relevant part of the server log is below:
11/17/2010 11:01:34;0008;PBS_Server;Job;431.pcp-s-2;Holds s set at request of maui at pcp-s-2
11/17/2010 11:01:34;0008;PBS_Server;Job;431.pcp-s-2;Holds s released at request of maui at pcp-s-2
11/17/2010 11:01:34;0008;PBS_Server;Job;432.pcp-s-2;could not locate requested resources 'pcp-x-2+pcp-x-4+pcp-x-3+pcp-x-1' (node_spec failed) job allocation request exceeds currently available cluster nodes, 4 requested, 0 available
And the Maui log looks like this:
12/03 13:47:51 MQueueScheduleIJobs(Q,DEFAULT)
12/03 13:47:51 INFO: 4 feasible tasks found for job 447:0 in partition DEFAULT (4 Needed)
12/03 13:47:51 INFO: inadequate feasible tasks found for job 447:0 (0 < 4)
12/03 13:47:51 MJobSelectPJobList(447,4,0,FJobList,PJList,PTCList,PNCList,PTL)
12/03 13:47:51 MRMJobCheckpoint(446,1,SC)
12/03 13:47:51 MPBSJobCkpt(446,R,SC)
12/03 13:47:51 INFO: attribute 'PREEMPTEE' set for job 446
12/03 13:47:51 INFO: tasks located for job 447: 4 of 4 required (4 feasible)
12/03 13:47:51 MJobStart(447)
12/03 13:47:51 MJobDistributeTasks(447,XEN2002,NodeList,TaskMap)
12/03 13:47:51 MAMAllocJReserve(447,RIndex,ErrMsg)
12/03 13:47:51 MRMJobStart(447,Msg,SC)
12/03 13:47:51 MPBSJobStart(447,XEN2002,Msg,SC)
12/03 13:47:51 ERROR: job '447' cannot be started: (rc: 15046 errmsg: 'Resource temporarily unavailable MSG=job allocation request exceeds currently available cluster nodes, 4 requested, 0 available' hostlist: 'pcp-x-2+pcp-x-4+pcp-x-3+pcp-x-1')
12/03 13:47:51 ALERT: cannot start job 447 (RM 'XEN2002' failed in function 'jobstart')
12/03 13:47:51 WARNING: cannot start job '447' through resource manager
12/03 13:47:51 ALERT: job '447' deferred after 1 failed start attempts (API failure on last attempt)
12/03 13:47:51 MJobSetHold(447,16,1:00:00,RMFailure,cannot start job - RM failure, rc: 15046, msg: 'Resource temporarily unavailable MSG=job allocation request exceeds currently available cluster nodes, 4 requested, 0 available')
12/03 13:47:51 ALERT: job '447' cannot run (deferring job for 3600 seconds)
12/03 13:47:51 MSysRegEvent(JOBDEFER: defer hold placed on job '447'. reason: 'RMFailure',0,0,1)
More information about the torquedev