[torqueusers] health check
mej at lbl.gov
Mon Aug 27 14:54:13 MDT 2012
On Monday, 27 August 2012, at 12:50:29 (-0700),
Arka Aloke Bhattacharya wrote:
> I was configuring torque on a 100-server cluster.
> I was wondering how common of a practice is it to configure a PBS_MOM to
> use a health-check script ?
It's quite common.
> How does one ensure that the health-check script covers all eventualities ?
> Can you give me advice help regarding what are the most common types
> of failures that the health-check script usually detects ?
At the risk of self-promotion, I recommend you check out our NHC
It's specifically engineered to be cross-site, portable, flexible, and
support any check you can write in a shell script. And it has
specific safeguards in place to protect against the type of lockup
issue Lloyd cited in his prior e-mail.
Give it a look-see; feedback is most welcome!
And for those who are already using it, I know I've been quiet, but
the new release will be out very soon with some great new features!
Michael Jennings <mej at lbl.gov>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E W: 510-495-2687
MS 050B-3209 F: 510-486-8615
More information about the torqueusers