[Mauiusers] Possible Maui bug preempting non-rerunnable jobs

Kevin Hildebrand kevin at umd.edu
Tue Apr 22 07:33:55 MDT 2008


Hello, I've discovered what appears to be a bug when Maui encounters jobs 
marked as preemptible, but the user has specified via qsub that the job 
is non-rerunnable.

I'm not sure what SHOULD happen in this case, but what IS happening is 
definitely non-desirable.

Currently, Maui is attempting to tell Torque to rerun the job, and Torque 
is refusing, because the job is marked non-rerunnable.  Maui is seeing 
this as a resource manager failure, and is bumping the RM FailCount.  With 
a bunch of active non-rerunnable jobs this pushes the FailCount above 
MAX_RMFAILCOUNT, and this stops Maui from processing the rest of the jobs 
in the queue.  The end result is that jobs back up in the queue, all of 
them showing "job can run in partition DEFAULT", but unable to run.

I've temporarily worked around the problem by commenting out the line that 
increments R->FailCount in MPBSJobRequeue (MPBSI.c) but that's probably 
not the best solution.

Some thoughts:
1) Why is Maui trying to tell Torque to rerun a job it should already know 
is non-rerunnable.

2) What is the general feeling for how non-rerunnable jobs should be 
handled in a preemptible queue?  Personally, I'd think either they 
shouldn't be allowed in the queue, or they should be killed if they need 
to be preempted.

Thanks,

Kevin Hildebrand
University of Maryland, College Park


More information about the mauiusers mailing list