[torqueusers] Node disk broken but NHC can't offline Torque node

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Mon Oct 7 03:59:03 MDT 2013


Hi Torque users,

We're extremely satisfied with the Node Health Check (NHC, 
http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check) system running 
on our Torque compute nodes from crontab. Quite a few node error 
conditions get detected by NHC, and Torque will offline the node when 
this happens.

However, today we had an interesting node hard disk failure which 
unfortunately didn't cause NHC to offline the node until a lot of job 
crashes caused a series of user complaints.

This scenario is due to a partly failing disk which causes the kernel to 
remount the disk in read-only mode. The node (named a071) is still 
running, and I can do "ssh a071 dmesg" to get a lot of errors like:

Buffer I/O error on device dm-0, logical block 7092949
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 7092950
lost page write due to I/O error on dm-0
sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 09 8c 07 10 00 00 08 00

If I try to log in to the node I get this tell-tale error:
# ssh a071
-bash: /root/.bash_profile: Input/output error
Connection to a071 closed.
This usually means that the disk is broken.

I can run NHC on the node with "ssh a071 nhc", giving this output:

/usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
/usr/sbin/nhc: line 313: /var/log/nhc.log: Read-only file system
/usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
/usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
/usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
/usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
/usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
/usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
/usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
/usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
/usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
ERROR Health check failed:  check_fs_mount:  /scratch mount options 
incorrect
/usr/sbin/nhc: line 34: /var/log/nhc.log: Read-only file system

The problem is that NHC can't send mail with Sendmail (since disk is 
read-only), and apparently is unable to offline the node because line 34 
in /usr/sbin/nhc fails since $LOGFILE is unwriteable:

         eval '$OFFLINE_NODE "$HOSTNAME" "$*" </dev/null >/dev/null' 
$LOGFILE '2>&1 &'

@Michael: Do you think that NHC could be made resilient to the case 
where $LOGFILE resides on a read-only filesystem?

Best regards,
Ole

-- 
Ole Holm Nielsen
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
E-mail: Ole.H.Nielsen at fysik.dtu.dk
Homepage: http://www.fysik.dtu.dk/~ohnielse/
Tel: (+45) 4525 3187 / Mobile (+45) 5180 1620 / Fax: (+45) 4593 2399


More information about the torqueusers mailing list