[torqueusers] health check
lloyd_brown at byu.edu
Mon Aug 27 13:58:16 MDT 2012
We use it. Generally we check for things like:
- host's firmware up to date (based on a list of "current" that we create)
- host's local hard drive not full
- host's local filesystems are writable
- host can connect to NFS central storage
- host's InfiniBand interface is up
- host doesn't have a sudden drop in total physical memory since last
time it booted (indicates DIMM failure)
- host's CPUs are not being throttled (pulled from "rdmsr" somehow
One thing we have run into is that when you have checks blocking, etc.,
from inside the health check script, and something takes too long, it
can cause scheduling problems due to lack of pbs_mom responsiveness.
The simple work-around is to have the individual checks put their
output/status in some sort of file or local database, and then have the
health check just grab the most recent statuses from those
files/databases. Of course you'll also want to deal with the situation
where the most recent output is "too old" (whatever that means in your
Fulton Supercomputing Lab
Brigham Young University
On 08/27/2012 01:50 PM, Arka Aloke Bhattacharya wrote:
> I was configuring torque on a 100-server cluster.
> I was wondering how common of a practice is it to configure a PBS_MOM to
> use a health-check script ?
> How does one ensure that the health-check script covers all eventualities ?
> Can you give me advice help regarding what are the most common types of
> failures that the health-check script usually detects ?
> Thanks a lot,
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers