[torquedev] suggestion on changing the mom's handling of
$node_check_script and $down_on_error
Bogdan Costescu
Bogdan.Costescu at iwr.uni-heidelberg.de
Thu Oct 23 10:19:23 MDT 2008
On Thu, 23 Oct 2008, Jerry Smith wrote:
> We try to avoid this by having our $PBS_HOME/$PBS_BIN directories
> NFS mounted, so if the script is "unavailable", the pbs_mom is too,
> and it could never start in the first place.
Not necessarily, if the $PBS_HOME has become unavailable after pbs_mom
was started; also if you configure pbs_mom to lock itself in memory it
doesn't need any pages from storage and can survive the dissapearance
of its binary (that's the theory anyway ;-))
However, to add to the OP failure situations: if the memory is tight
because the job uses all available memory, the health script will fail
to start. Reading a file which is produced by another memory-locked
daemon (which doesn't itself try to fork/exec something else) will
still work in that case.
Speaking about disk crashes, if the health script was executed before
and not yet evicted from cache (because the checking interval is short
or the memory is largely unused and caching can use most of it), it
could still be executed while the actual storage device would not be
able to provide it anymore.
>> 1) stat() the file, if $down_on_error is set and accessing the
>> file fails, set the node down with the message "Can't access
>> $node_check_file".
This assumes that you use that file as an indicator of health of disk
access in particular. You could put that file in a memory based FS
(tmpfs) and be able to survive such a disk failure - assuming that you
don't need the content of that file across reboots; this would allow
pbs_mom to know that the node is still usable if disk access is not
needed (of course, if the OS, applications and libraries are not on
that disk...)
>> 2) If the file is too old (I guess old should be another configuration
>> option), emit an error and set the node down if appropriate.
Yes, that's a requirement for a monitoring system built of several
components that communicate through files.
--
Bogdan Costescu
IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.costescu at iwr.uni-heidelberg.de
More information about the torquedev
mailing list