11.2 Compute Node Health Check
11.2.1 Compute Node Health Check Overview
TORQUE provides the ability to perform health checks on each compute node. If these checks fail, a failure message can be associated with the node and routed to the scheduler. Schedulers (such as Moab) can forward this information to administrators by way of scheduler triggers, make it available through scheduler diagnostic commands, and automatically mark the node down until the issue is resolved. (See the RMMSGIGNORE parameter in Appendix F of the Moab Workload Manager Administrator's Guide for more information.)
11.2.2 Configuring MOM's to Launch a Health Check
The health check feature is configured via the pbs_mom config file using the parameters described below:
11.2.3 Creating the Health Check Script
The health check script is executed directly by the pbs_mom daemon under the root user id. It must be accessible from the compute node and may be a script or compile executable program. It may make any needed system calls and execute any combination of system utilities but should not execute resource manager client commands. Also, as of TORQUE 1.0.1, the pbs_mom daemon blocks until the health check is completed and does not possess a built-in timeout. Consequently, it is advisable to keep the launch script execution time short and verify that the script will not block even under failure conditions.
If the script detects a failure, it should return the keyword 'ERROR' to stdout followed by an error message. When a failure is detected, the ERROR keyword should be printed to stdout before any other data. The message (up to 1024 characters) immediately following the ERROR keyword must all be contained on the same line. The message is assigned to the node attribute 'message' of the associated node.
11.2.4 Adjusting Node State Based on the Health Check Output
If the health check reports an error, the node attribute 'message' is set to the error string returned. Cluster schedulers can be configured to adjust a given node's state based on this information. For example, by default, Moab sets a node's state to down if a node error message is detected and restores the state as soon as the failure disappears.
11.2.5 Example Health Check Script
As mentioned, the health check can be a shell script, PERL, Python, C-executable, or anything which can be executed from the command line capable of setting STDOUT. The example below demonstrates a very simple health check:
#!/bin/sh /bin/mount | grep global if [ $? != "0" ] then echo "ERROR cannot locate filesystem global" fi
|© 2001-2010 Adaptive Computing Enterprises, Inc.|