[torquedev] suggestion on changing the mom's handling of $node_check_script and $down_on_error

Michael Barnes Michael.Barnes at jlab.org
Thu Oct 23 08:56:44 MDT 2008


Torque developers,

I like the addition of the $node_check_script and $down_on_error for the
pbs_mom, but I think its more complicated than it needs to be, and not as
robust as it could be.

For example, I have this in place on a few clusters that I manage, but
I've seen more than once where a compute node has had a failed disk, and
because the disk was failed, the $node_check_script could not run, which
is silently ignored, and then tons of jobs ran through that node as fast
as they could draining tons of user's jobs.

I think that a different model would be to get rid of
$node_check_script, and replace it with $node_check_file.  And instead
of reading STDIN from the script, do the following:

1) stat() the file, if $down_on_error is set and accessing the
file fails, set the node down with the message "Can't access
$node_check_file".

2) If the file is too old (I guess old should be another configuration
option), emit an error and set the node down if appropriate.

3) If we got past 1 and 2, then read the file, if the file says "ERROR
whatever the error is", then echo that error, and set the node down if
appropriate.


Checking a file, vs running an executable has a few advantages. Its
simpler to implement, no need to fork() and run the script. Also,
the node health script or executable could potentially be resource
intensive, and as its implemented now, the running of that script is
not synchronized to a clock, its run at intervals offset from the time
the mom started running, and we've seen where having unsynchronized
and moderately resource intensive processes running on compute nodes
that are running a job across many nodes can drastically hurt the
performance of the parallel job. By reading a file, the health script
or application can be run from cron which would minimize the program
interrupting large parallel jobs.

Comments/suggestions?

-mb

-- 
+-----------------------------------------------
| Michael Barnes
|
| Thomas Jefferson National Accelerator Facility
| 12000 Jefferson Ave.
| Newport News, VA 23606
| (757) 269-7634
+-----------------------------------------------


More information about the torquedev mailing list