[torqueusers] Node disk broken but NHC can't offline Torque node

Michael Jennings mej at lbl.gov
Mon Oct 7 17:20:31 MDT 2013


On Monday, 07 October 2013, at 11:59:03 (+0200),
Ole Holm Nielsen wrote:

> We're extremely satisfied with the Node Health Check (NHC,
> http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check) system
> running on our Torque compute nodes from crontab. Quite a few node
> error conditions get detected by NHC, and Torque will offline the
> node when this happens.

I realize that link can make for a lot of typing, so in case you want
to link to it in the future, this one's easier to type (and to
remember!):

http://go.lbl.gov/nhc

Not really relevant, but hopefully helpful. :-)

> However, today we had an interesting node hard disk failure which
> unfortunately didn't cause NHC to offline the node until a lot of
> job crashes caused a series of user complaints.
> 
> This scenario is due to a partly failing disk which causes the
> kernel to remount the disk in read-only mode. The node (named a071)
> is still running, and I can do "ssh a071 dmesg" to get a lot of
> errors like:
> 
> Buffer I/O error on device dm-0, logical block 7092949
> lost page write due to I/O error on dm-0
> Buffer I/O error on device dm-0, logical block 7092950
> lost page write due to I/O error on dm-0
> sd 0:0:0:0: [sda] Unhandled error code
> sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 09 8c 07 10 00 00 08 00
> 
> If I try to log in to the node I get this tell-tale error:
> # ssh a071
> -bash: /root/.bash_profile: Input/output error
> Connection to a071 closed.
> This usually means that the disk is broken.
> 
> I can run NHC on the node with "ssh a071 nhc", giving this output:
> 
> /usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
> /usr/sbin/nhc: line 313: /var/log/nhc.log: Read-only file system
> /usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
> /usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
> /usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
> /usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
> /usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
> /usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
> /usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
> /usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
> /usr/sbin/nhc: line 62: /var/log/nhc.log: Read-only file system
> ERROR Health check failed:  check_fs_mount:  /scratch mount options
> incorrect
> /usr/sbin/nhc: line 34: /var/log/nhc.log: Read-only file system
>
> The problem is that NHC can't send mail with Sendmail (since disk is
> read-only), and apparently is unable to offline the node because
> line 34 in /usr/sbin/nhc fails since $LOGFILE is unwriteable:
> 
>         eval '$OFFLINE_NODE "$HOSTNAME" "$*" </dev/null >/dev/null'
> $LOGFILE '2>&1 &'
> 
> @Michael: Do you think that NHC could be made resilient to the case
> where $LOGFILE resides on a read-only filesystem?

Absolutely.  The easiest way would be to log to something that will
never have that problem (e.g., LOGFILE=">/dev/null" or something
similar).  :-)

That said, since all the logging is done by a single function, it's
easy enough to add error handling to it.  In fact, I just did. :-)

diff --git a/nhc b/nhc
index 731e3bb..2b5b8c7 100755
--- a/nhc
+++ b/nhc
@@ -53,6 +53,10 @@ function die() {
 function dbg() {
     if [[ "$DEBUG" != "0" ]]; then
         eval echo '"DEBUG:  $*"' $LOGFILE
+        if [[ $? -ne 0 ]]; then
+            syslog "Unable to write to LOGFILE (\"$LOGFILE\") as user \"$USER\" (uid $EUID) -- Read-only filesystem?  Suppressing further output."
+            export LOGFILE=">/dev/null"
+        fi
     fi
 }
 
@@ -60,6 +64,10 @@ function dbg() {
 function log() {
     if [[ "$SILENT" = "0" ]]; then
         eval echo '"$@"' $LOGFILE
+        if [[ $? -ne 0 ]]; then
+            syslog "Unable to write to LOGFILE (\"$LOGFILE\") as user \"$USER\" (uid $EUID) -- Read-only filesystem?  Suppressing further output."
+            export LOGFILE=">/dev/null"
+        fi
     fi
 }
 

That should prevent the helper from failing as well as ensure that
ERROR appears on the final line of output so that, if down_on_error is
set in pbs_mom, it will do the right thing and down the node.

I tested this locally by running it as non-root trying to log to
/var/log/nhc.log, and it did indeed suppress further error output.
Please let me know if that does/doesn't address your particular
scenario.  Thanks for the bug report!  :-)



On Monday, 07 October 2013, at 03:35:16 (-0700),
Roman Baranowski wrote:

> For many years our local health check script does:
> 
> #
> # 2d)  - read/write on local /
> #
>          touch /write.test  >& /dev/null
>          if [ $? != "0" ]
>          then
>            echo "ERROR cannot write to  local root FS "
>            exit 1
>          fi
> 
> similar test is performed on all other 'local' (/tmp, etc) partitions
> Works like a charm....

NHC offers checks for identifying filesystems which have been
re-mounted read-only.  Ole's output actually shows that it had
identified one (/scratch).  But that doesn't actually resolve the
problem in question.  :-)

Thanks for the suggestion, though!

Michael

-- 
Michael Jennings <mej at lbl.gov>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E        W: 510-495-2687
MS 050B-3209          F: 510-486-8615


More information about the torqueusers mailing list