[torqueusers] Re: pbs_mom caches last healthcheck script error ?
(Re: [Moabusers] Moab keeps on trying after pbs_mom rejects.)
Garrick Staples
garrick at usc.edu
Mon Dec 4 16:28:25 MST 2006
On Tue, Dec 05, 2006 at 10:25:04AM +1100, Chris Samuel alleged:
> On Tuesday 05 December 2006 10:06, Garrick Staples wrote:
>
> Quick clarification - only Moab has the node marked down, the pbs_server
> thinks it's free (but pbsnodes -a lists the message in the mom).
>
> > Use 'momctl -C' instead.
>
> Doesn't appear to do anything:
>
> # momctl -d 1
> [...]
> MOM Message: ERROR myrinet card is not in 64bit mode
> (use 'momctl -q clearmsg' to clear)
> [...]
> # momctl -C
> mom localhost successfully cycled cycle forced
> # momctl -d 1
> [...]
> MOM Message: ERROR myrinet card is not in 64bit mode
> (use 'momctl -q clearmsg' to clear)
> [...]
>
> > Though it would have cleared by the time you read this.
>
> Er, it's been like this for getting on a week now, that's why we hit the lists
> to see if we could track this down.
>
> > The "error message" in MOM can come from multiple places, and is sent to
> > pbs_server every update interval (45 seconds from your output).
>
> OK..
>
> > The health check script is one possible way to trigger an error message,
> > but since it only run every "node_check_interval" intervals, the
> > script's output is cached. ?Every interval, the cached copy of the error
> > message is copied into the error message buffer unless it is time to
> > rerun the script.
>
> This is even persisting across MOM restarts as well, which is what really
> puzzles me.. :-(
>
> > 'momctl -q clearmsg' just clears the error message, not the status of
> > the health check script.
>
> Aha, OK.
>
> > 'momctl -C' clears the counter for the health check and triggers a new
> > interval.
>
> Odd thing is if I do the clearmsg then the messages goes away for a while, but
> if I do the clearmsg immediately followed by -C the message is right back,
> and Moab keeps the node marked down (even though the pbs_server thinks it's
> free).
Then your health check script is returning the error.
As far as moab is concerned, the important part is what pbs_server says?
'pbsnodes -a $name' has "message=ERROR ..."?
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20061204/1644e97c/attachment.bin
More information about the torqueusers
mailing list