[torquedev] suggestion on changing the mom's handling of $node_check_script and $down_on_error

Bogdan Costescu Bogdan.Costescu at iwr.uni-heidelberg.de
Thu Oct 23 10:19:23 MDT 2008


On Thu, 23 Oct 2008, Jerry Smith wrote:

> We try to avoid this by having our $PBS_HOME/$PBS_BIN directories 
> NFS mounted, so if the script is "unavailable", the pbs_mom is too, 
> and it could never start in the first place.

Not necessarily, if the $PBS_HOME has become unavailable after pbs_mom 
was started; also if you configure pbs_mom to lock itself in memory it 
doesn't need any pages from storage and can survive the dissapearance 
of its binary (that's the theory anyway ;-))

However, to add to the OP failure situations: if the memory is tight 
because the job uses all available memory, the health script will fail 
to start. Reading a file which is produced by another memory-locked 
daemon (which doesn't itself try to fork/exec something else) will 
still work in that case.

Speaking about disk crashes, if the health script was executed before 
and not yet evicted from cache (because the checking interval is short 
or the memory is largely unused and caching can use most of it), it 
could still be executed while the actual storage device would not be 
able to provide it anymore.

>>  1) stat() the file, if $down_on_error is set and accessing the
>>  file fails, set the node down with the message "Can't access
>>  $node_check_file".

This assumes that you use that file as an indicator of health of disk 
access in particular. You could put that file in a memory based FS 
(tmpfs) and be able to survive such a disk failure - assuming that you 
don't need the content of that file across reboots; this would allow 
pbs_mom to know that the node is still usable if disk access is not 
needed (of course, if the OS, applications and libraries are not on 
that disk...)

>>  2) If the file is too old (I guess old should be another configuration
>>  option), emit an error and set the node down if appropriate.

Yes, that's a requirement for a monitoring system built of several 
components that communicate through files.

-- 
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.costescu at iwr.uni-heidelberg.de


More information about the torquedev mailing list