[torqueusers] Getting Torque 2.x/Maui 3.x to work on CentOS 6?

Michael Jennings mej at lbl.gov
Tue Nov 20 13:16:59 MST 2012


On Tuesday, 20 November 2012, at 10:32:12 (-0500),
Paul Raines wrote:

> The worst is when a node "goes bad" such as its root disk fails.
> When that happens, job still get scheduled on the node but fail in a
> manner where they just get stuck with checkjob giving a reason of
> "Execution server rejected request MSG=cannot send job to mom,
> state=PRERUN". THe bad thing is it keeps queuing all new jobs on the
> node which get stuck the same way.  Yesterday when this happened, I
> had over 180 jobs trying to run on the bad node and getting stuck.
> THe only solution is to take the node offline and then qdel all the
> jobs and email all my users apologizing and telling them to resubmit
> all their jobs.

I can't do anything about problem #2, but problem #1 seems like it
might be resolved by using a node health check script.  The Warewulf
NHC project can be used in this role and has built-in checks for
making sure the root filesystem is mounted read-write and has not
errored out.  It can also verify that your filesystems haven't filled.

If you're interested, I'd be happy to help you with getting it set up
and working in your environment.

More info is available on the web site at:
https://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check

HTH!
Michael

-- 
Michael Jennings <mej at lbl.gov>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E        W: 510-495-2687
MS 050B-3209          F: 510-486-8615


More information about the torqueusers mailing list