[torqueusers] Node disk broken but NHC can't offline Torque node

Roman Baranowski roman at chem.ubc.ca
Mon Oct 7 04:35:16 MDT 2013



 	Dear All,

For many years our local health check script does:

#
# 2d)  - read/write on local /
#
         touch /write.test  >& /dev/null
         if [ $? != "0" ]
         then
           echo "ERROR cannot write to  local root FS "
           exit 1
         fi

similar test is performed on all other 'local' (/tmp, etc) partitions
Works like a charm....

The mom config:
$node_check_script      /var/spool/torque/mom_priv/check_node.sh
$node_check_interval    5


 	All the best
 	Roman


On Mon, 7 Oct 2013, Ole Holm Nielsen wrote:

> Hi Torque users,
>
> We're extremely satisfied with the Node Health Check (NHC,
> http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check) system running
> on our Torque compute nodes from crontab. Quite a few node error
> conditions get detected by NHC, and Torque will offline the node when
> this happens.
>
> However, today we had an interesting node hard disk failure which
> unfortunately didn't cause NHC to offline the node until a lot of job
> crashes caused a series of user complaints.
>
> This scenario is due to a partly failing disk which causes the kernel to
> remount the disk in read-only mode. The node (named a071) is still
> running, and I can do "ssh a071 dmesg" to get a lot of errors like:
>
> Buffer I/O error on device dm-0, logical block 7092949
> lost page write due to I/O error on dm-0
> Buffer I/O error on device dm-0, logical block 7092950
> lost page write due to I/O error on dm-0
> sd 0:0:0:0: [sda] Unhandled error code
> sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 09 8c 07 10 00 00 08 00
>
> If I try to log in to the node I get this tell-tale error:
> # ssh a071
> -bash: /root/.bash_profile: Input/output error
> Connection to a071 closed.
> This usually means that the disk is broken.
>
> I can run NHC on the node with "ssh a071 nhc", giving this output:
>
> /usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
> /usr/sbin/nhc: line 313: /var/log/nhc.log: Read-only file system
> /usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
> /usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
> /usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
> /usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
> /usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
> /usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
> /usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
> /usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
> /usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
> ERROR Health check failed:  check_fs_mount:  /scratch mount options
> incorrect
> /usr/sbin/nhc: line 34: /var/log/nhc.log: Read-only file system
>
> The problem is that NHC can't send mail with Sendmail (since disk is
> read-only), and apparently is unable to offline the node because line 34
> in /usr/sbin/nhc fails since $LOGFILE is unwriteable:
>
>         eval '$OFFLINE_NODE "$HOSTNAME" "$*" </dev/null >/dev/null'
> $LOGFILE '2>&1 &'
>
> @Michael: Do you think that NHC could be made resilient to the case
> where $LOGFILE resides on a read-only filesystem?
>
> Best regards,
> Ole
>
> -- 
> Ole Holm Nielsen
> Department of Physics, Technical University of Denmark,
> Building 307, DK-2800 Kongens Lyngby, Denmark
> E-mail: Ole.H.Nielsen at fysik.dtu.dk
> Homepage: http://www.fysik.dtu.dk/~ohnielse/
> Tel: (+45) 4525 3187 / Mobile (+45) 5180 1620 / Fax: (+45) 4593 2399
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>


More information about the torqueusers mailing list