[torqueusers] pbs_mom caches last healthcheck script error ? (Re: [Moabusers] Moab keeps on trying after pbs_mom rejects.)

Chris Samuel csamuel at vpac.org
Mon Dec 4 15:42:34 MST 2006


On Tuesday 05 December 2006 05:12, wightman wrote:

> When the message no longer returns an ERROR, then Moab correctly places
> the node back into the scheduling queue.
>
> What are you seeing on your cluster?

It's actually looking like a Torque problem, sorry!

The script returns nothing (as it should) but the MOM seems to be remembering 
the last error it saw, and it even returns if we clear it by hand.

For instance, a node had a Myrinet card replaced and they forgot to set the 
switch on the card to 64-bit mode.  Our script picked it up correctly and 
placed the node offline.  Then we fixed the card, brought the node back up 
and the script saw everything was fine but the MOM was still down in Moab 
with the old error.

We noticed the mom still had the message and assumed we'd have to clear it by 
hand, thus:

# momctl -d 1

[...]
Server Update Interval: 45 seconds
MOM Message:            ERROR myrinet card is not in 64bit mode
 (use 'momctl -q clearmsg' to clear)
LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
[...]
# momctl -q clearmsg
   localhost:     clearmsg = 'messages cleared'

Message went away..

# momctl -d 1
[...]
Server Update Interval: 45 seconds
LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
[...]

Then, within a minute it's back, even though the script isn't triggering it:

# momctl -d 1
[...]
Server Update Interval: 45 seconds
MOM Message:            ERROR myrinet card is not in 64bit mode
 (use 'momctl -q clearmsg' to clear)
LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)

But the script says everything is OK!

# /usr/local/sbin/moab-check-health.sh
#

Brett has this cluster running Torque 2.2.0-snap.200610191709.

cheers!
Chris
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20061205/515ff651/attachment.bin


More information about the torqueusers mailing list