[torqueusers] Re: pbs_mom caches last healthcheck script error ?
(Re: [Moabusers] Moab keeps on trying after pbs_mom rejects.)
Chris Samuel
csamuel at vpac.org
Mon Dec 4 16:25:04 MST 2006
On Tuesday 05 December 2006 10:06, Garrick Staples wrote:
Quick clarification - only Moab has the node marked down, the pbs_server
thinks it's free (but pbsnodes -a lists the message in the mom).
> Use 'momctl -C' instead.
Doesn't appear to do anything:
# momctl -d 1
[...]
MOM Message: ERROR myrinet card is not in 64bit mode
(use 'momctl -q clearmsg' to clear)
[...]
# momctl -C
mom localhost successfully cycled cycle forced
# momctl -d 1
[...]
MOM Message: ERROR myrinet card is not in 64bit mode
(use 'momctl -q clearmsg' to clear)
[...]
> Though it would have cleared by the time you read this.
Er, it's been like this for getting on a week now, that's why we hit the lists
to see if we could track this down.
> The "error message" in MOM can come from multiple places, and is sent to
> pbs_server every update interval (45 seconds from your output).
OK..
> The health check script is one possible way to trigger an error message,
> but since it only run every "node_check_interval" intervals, the
> script's output is cached. Every interval, the cached copy of the error
> message is copied into the error message buffer unless it is time to
> rerun the script.
This is even persisting across MOM restarts as well, which is what really
puzzles me.. :-(
> 'momctl -q clearmsg' just clears the error message, not the status of
> the health check script.
Aha, OK.
> 'momctl -C' clears the counter for the health check and triggers a new
> interval.
Odd thing is if I do the clearmsg then the messages goes away for a while, but
if I do the clearmsg immediately followed by -C the message is right back,
and Moab keeps the node marked down (even though the pbs_server thinks it's
free).
cheers,
Chris
--
Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20061205/4caf85f2/attachment.bin
More information about the torqueusers
mailing list