[torqueusers] health check

Michael Jennings mej at lbl.gov
Mon Aug 27 14:54:13 MDT 2012


On Monday, 27 August 2012, at 12:50:29 (-0700),
Arka Aloke Bhattacharya wrote:

> I was configuring torque on a 100-server cluster.
> I was wondering how common of a practice is it to configure a PBS_MOM to
> use a health-check script ?

It's quite common.

> How does one ensure that the health-check script covers all eventualities ?

Experience.  ;-)

> Can you give me advice help regarding what are the most common types
> of failures that the health-check script usually detects ?

At the risk of self-promotion, I recommend you check out our NHC
project:

http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check

It's specifically engineered to be cross-site, portable, flexible, and
support any check you can write in a shell script.  And it has
specific safeguards in place to protect against the type of lockup
issue Lloyd cited in his prior e-mail.

Give it a look-see; feedback is most welcome!

And for those who are already using it, I know I've been quiet, but
the new release will be out very soon with some great new features!
:-)

HTH,
Michael

-- 
Michael Jennings <mej at lbl.gov>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E        W: 510-495-2687
MS 050B-3209          F: 510-486-8615


More information about the torqueusers mailing list