[torqueusers] Re: pbs_mom caches last healthcheck script error ? (Re: [Moabusers] Moab keeps on trying after pbs_mom rejects.)

Garrick Staples garrick at clusterresources.com
Mon Dec 4 16:06:07 MST 2006


On Tue, Dec 05, 2006 at 09:42:34AM +1100, Chris Samuel alleged:
> On Tuesday 05 December 2006 05:12, wightman wrote:
> 
> > When the message no longer returns an ERROR, then Moab correctly places
> > the node back into the scheduling queue.
> >
> > What are you seeing on your cluster?
> 
> It's actually looking like a Torque problem, sorry!
> 
> The script returns nothing (as it should) but the MOM seems to be remembering 
> the last error it saw, and it even returns if we clear it by hand.
> 
> For instance, a node had a Myrinet card replaced and they forgot to set the 
> switch on the card to 64-bit mode.  Our script picked it up correctly and 
> placed the node offline.  Then we fixed the card, brought the node back up 
> and the script saw everything was fine but the MOM was still down in Moab 
> with the old error.
> 
> We noticed the mom still had the message and assumed we'd have to clear it by 
> hand, thus:
> 
> # momctl -d 1
> 
> [...]
> Server Update Interval: 45 seconds
> MOM Message:            ERROR myrinet card is not in 64bit mode
>  (use 'momctl -q clearmsg' to clear)
> LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
> [...]
> # momctl -q clearmsg
>    localhost:     clearmsg = 'messages cleared'
> 
> Message went away..
> 
> # momctl -d 1
> [...]
> Server Update Interval: 45 seconds
> LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
> [...]
> 
> Then, within a minute it's back, even though the script isn't triggering it:
> 
> # momctl -d 1
> [...]
> Server Update Interval: 45 seconds
> MOM Message:            ERROR myrinet card is not in 64bit mode
>  (use 'momctl -q clearmsg' to clear)
> LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
> 
> But the script says everything is OK!
> 
> # /usr/local/sbin/moab-check-health.sh
> #
> 
> Brett has this cluster running Torque 2.2.0-snap.200610191709.

Use 'momctl -C' instead.  Though it would have cleared by the time you read
this.

The "error message" in MOM can come from multiple places, and is sent to
pbs_server every update interval (45 seconds from your output).

The health check script is one possible way to trigger an error message,
but since it only run every "node_check_interval" intervals, the
script's output is cached.  Every interval, the cached copy of the error
message is copied into the error message buffer unless it is time to
rerun the script.

'momctl -q clearmsg' just clears the error message, not the status of
the health check script.

'momctl -C' clears the counter for the health check and triggers a new
interval.



More information about the torqueusers mailing list