[torqueusers] NHC check for mcelog errors

Michael Jennings mej at lbl.gov
Mon Nov 26 18:19:31 MST 2012


On Monday, 26 November 2012, at 14:46:40 (+0100),
Ole Holm Nielsen wrote:

> At SC12 I suggested to you to add a check to Node Health Check to
> inquire the mcelog daemon for hardware errors and offline sick nodes
> in Torque.  Along these lines, we just today had a compute node with
> a memory error causing /var/log/mcelog to fill up the / file system
> :-)

We *do* have a check for / being full....  ;-)

> Reading the mcelog man-page I found that there exists a simple
> interface to inquiring the mcelogd daemon for detected hardware
> errors (please see the sample below).
> 
> I wonder if you would consider adding a check to NHC saying that if
> the output of the command "mcelog --client" is non-empty, then the
> node should be offlined by Torque?  Reading your Wiki page
> http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check it wasn't
> immediately obvious to me how to write the most efficient check
> myself.
> 
> Please note that mcelog version 1.0 is required in order to have the
> --client flag available.  This works fine on CentOS 6.x, whereas
> CentOS 5.x has an older version 0.7.  The mcelog project homepage is
> http://mcelog.org/ and there is a paper at
> http://halobates.de/lk10-mcelog.pdf.

It's certainly doable.  I can think of two caveats, though.

1.  The /dev/mcelog ring buffer doesn't work quite like other kernel
    ring buffers.  Once the data is read via mcelog, it's gone.  RHEL6
    has an hourly cron job, /etc/cron.hourly/mcelog.cron, which would
    interfere with NHC's execution of mcelog --client, and vice versa.

2.  If the MCE is transient (i.e., it doesn't happen constantly), NHC
    may wind up offlining the node due to the MCE check during one run
    and onlining it again during the next run (because no further MCEs
    have occurred in the interim).  Is that acceptable?

What are your thoughts?

Michael

-- 
Michael Jennings <mej at lbl.gov>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E        W: 510-495-2687
MS 050B-3209          F: 510-486-8615


More information about the torqueusers mailing list