[torqueusers] Re: pbs_mom caches last healthcheck script error ? (Re: [Moabusers] Moab keeps on trying after pbs_mom rejects.)

Garrick Staples garrick at usc.edu
Mon Dec 4 16:28:25 MST 2006


On Tue, Dec 05, 2006 at 10:25:04AM +1100, Chris Samuel alleged:
> On Tuesday 05 December 2006 10:06, Garrick Staples wrote:
> 
> Quick clarification - only Moab has the node marked down, the pbs_server 
> thinks it's free (but pbsnodes -a lists the message in the mom).
> 
> > Use 'momctl -C' instead.
> 
> Doesn't appear to do anything:
> 
> # momctl -d 1
> [...]
> MOM Message:            ERROR myrinet card is not in 64bit mode
>  (use 'momctl -q clearmsg' to clear)
> [...]
> # momctl -C
> mom localhost successfully cycled cycle forced
> # momctl -d 1
> [...]
> MOM Message:            ERROR myrinet card is not in 64bit mode
>  (use 'momctl -q clearmsg' to clear)
> [...]
> 
> > Though it would have cleared by the time you read this.
> 
> Er, it's been like this for getting on a week now, that's why we hit the lists 
> to see if we could track this down.
> 
> > The "error message" in MOM can come from multiple places, and is sent to
> > pbs_server every update interval (45 seconds from your output).
> 
> OK..
> 
> > The health check script is one possible way to trigger an error message,
> > but since it only run every "node_check_interval" intervals, the
> > script's output is cached. ?Every interval, the cached copy of the error
> > message is copied into the error message buffer unless it is time to
> > rerun the script.
> 
> This is even persisting across MOM restarts as well, which is what really 
> puzzles me.. :-(
> 
> > 'momctl -q clearmsg' just clears the error message, not the status of
> > the health check script.
> 
> Aha, OK.
> 
> > 'momctl -C' clears the counter for the health check and triggers a new
> > interval.
> 
> Odd thing is if I do the clearmsg then the messages goes away for a while, but 
> if I do the clearmsg immediately followed by -C the message is right back, 
> and Moab keeps the node marked down (even though the pbs_server thinks it's 
> free).

Then your health check script is returning the error.

As far as moab is concerned, the important part is what pbs_server says?
'pbsnodes -a $name' has "message=ERROR ..."?

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20061204/1644e97c/attachment.bin


More information about the torqueusers mailing list