TORQUE Resource Manager

TORQUE Administrator's Manual - 10.2 Compute Node Health Check

10.2 Compute Node Health Check

10.2.1 Compute Node Health Check Overview

   TORQUE provides the ability to perform health checks on each compute node.  If these checks fail, a failure message can be associated with the node and routed to the scheduler.  Schedulers (such as Moab) can forward this information to administrators by way of scheduler triggers, make it available through scheduler diagnostic commands, and automatically mark the node down until the issue is resolved. (See the RMMSGIGNORE parameter)

10.2.2 Configuring MOM's to Launch a Health Check

   The health check feature is configured via the pbs_mom config file using the parameters described below:

parameter format default description
node_check_script <STRING> N/A (required) specifies the fully qualified pathname of the health check script to run
node_check_interval <INTEGER> 1 (optional) specifies the number of MOM intervals between subsequent executions of the specified health check (by default, each MOM interval is 45 seconds long - this is controlled via the DEFAULT_SERVER_STAT_UPDATES #define located in $TORQUEDIR/src/resmom/mom_main.c)

10.2.3 Creating the Health Check Script

   The health check script is executed directly by the pbs_mom daemon under the root user id. It must be accessible from the compute node and may be a script or compile executable program.  It may make any needed system calls and execute any combination of system utilities but should not execute resource manager client commands.  Also, as of TORQUE 1.0.1, the pbs_mom daemon blocks until the health check is completed and does not possess a built-in timeout.  Consequently, it is advisable to keep the launch script execution time short and verify that the script will not block even under failure conditions.

If the script detects a failure, it should return the keyword 'ERROR' to stdout followed by an error message.  The message (up to 256 characters) immediately following the ERROR string will be assigned to the node attribute 'message' of the associated node.

10.2.4 Adjusting Node State Based on the Health Check Output

   If the health check reports an error, the node attribute 'message' is set to the error string returned.  Cluster Schedulers can be configured to adjust a given node's state based on this information.  For example, by default, Moab sets a node's state to down if a node error message is detected and restores the state as soon as the failure disappears.

10.2.5 Example Health Check Script

   As mentioned, the health check can be a shell script, PERL, Python, C-executable, or anything which can be executed from the command line capable of setting STDOUT.   The example below demonstrates a very simple health check:

#!/bin/sh

/bin/mount | grep global

if [ $? != "0" ]
  then
    echo "ERROR cannot locate filesystem global"
  fi