[torqueusers] Re: pbs_mom caches last healthcheck script error ? (Re: [Moabusers] Moab keeps on trying after pbs_mom rejects.)

Chris Samuel csamuel at vpac.org
Mon Dec 4 16:25:04 MST 2006

On Tuesday 05 December 2006 10:06, Garrick Staples wrote:

Quick clarification - only Moab has the node marked down, the pbs_server 
thinks it's free (but pbsnodes -a lists the message in the mom).

> Use 'momctl -C' instead.

Doesn't appear to do anything:

# momctl -d 1
MOM Message:            ERROR myrinet card is not in 64bit mode
 (use 'momctl -q clearmsg' to clear)
# momctl -C
mom localhost successfully cycled cycle forced
# momctl -d 1
MOM Message:            ERROR myrinet card is not in 64bit mode
 (use 'momctl -q clearmsg' to clear)

> Though it would have cleared by the time you read this.

Er, it's been like this for getting on a week now, that's why we hit the lists 
to see if we could track this down.

> The "error message" in MOM can come from multiple places, and is sent to
> pbs_server every update interval (45 seconds from your output).


> The health check script is one possible way to trigger an error message,
> but since it only run every "node_check_interval" intervals, the
> script's output is cached.  Every interval, the cached copy of the error
> message is copied into the error message buffer unless it is time to
> rerun the script.

This is even persisting across MOM restarts as well, which is what really 
puzzles me.. :-(

> 'momctl -q clearmsg' just clears the error message, not the status of
> the health check script.

Aha, OK.

> 'momctl -C' clears the counter for the health check and triggers a new
> interval.

Odd thing is if I do the clearmsg then the messages goes away for a while, but 
if I do the clearmsg immediately followed by -C the message is right back, 
and Moab keeps the node marked down (even though the pbs_server thinks it's 

 Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20061205/4caf85f2/attachment.bin

More information about the torqueusers mailing list