[torquedev] suggestion on changing the mom's handling of $node_check_script and $down_on_error

Jerry Smith jdsmit at sandia.gov
Thu Oct 23 09:44:59 MDT 2008

Michael Barnes wrote:
> Torque developers,
> I like the addition of the $node_check_script and $down_on_error for the
> pbs_mom, but I think its more complicated than it needs to be, and not as
> robust as it could be.
> For example, I have this in place on a few clusters that I manage, but
> I've seen more than once where a compute node has had a failed disk, and
> because the disk was failed, the $node_check_script could not run, which
> is silently ignored, and then tons of jobs ran through that node as fast
> as they could draining tons of user's jobs.
We try to avoid this by having our $PBS_HOME/$PBS_BIN directories NFS 
mounted, so if the script is "unavailable", the pbs_mom is too, and it 
could never start in the first place. In the event of a mom that is 
already up and we lose the local disk, the NFS mounted health script is 
setup to check for the evidence of disk failure and take appropriate action.

> I think that a different model would be to get rid of
> $node_check_script, and replace it with $node_check_file.  And instead
> of reading STDIN from the script, do the following:
> 1) stat() the file, if $down_on_error is set and accessing the
> file fails, set the node down with the message "Can't access
> $node_check_file".
> 2) If the file is too old (I guess old should be another configuration
> option), emit an error and set the node down if appropriate.
> 3) If we got past 1 and 2, then read the file, if the file says "ERROR
> whatever the error is", then echo that error, and set the node down if
> appropriate.
What are you doing to create the "$node_check_file", and fill it with 
the appropriate "ERROR:" message ? Are you running a script? If so then 
this is just passing that off one more step, and not adding any 
functionality, IMHO. If it can't get to that external script, ( as in 
your example ) then how do you populate the $node_check_file?

> Checking a file, vs running an executable has a few advantages. Its
> simpler to implement, no need to fork() and run the script. Also,
> the node health script or executable could potentially be resource
> intensive, and as its implemented now, the running of that script is
> not synchronized to a clock, its run at intervals offset from the time
> the mom started running, and we've seen where having unsynchronized
> and moderately resource intensive processes running on compute nodes
> that are running a job across many nodes can drastically hurt the
> performance of the parallel job. By reading a file, the health script
> or application can be run from cron which would minimize the program
> interrupting large parallel jobs.
You are still running a script, be it from the control of pbs_mom, or 
from cron. Proper care in your health_checks is always advised ( 
avoiding if possible, resource intensive calls etc..), but you are still 
consuming resources regardless of control source.

> Comments/suggestions?

More information about the torquedev mailing list