[torqueusers] NHC check for mcelog errors
mej at lbl.gov
Mon Nov 26 18:19:31 MST 2012
On Monday, 26 November 2012, at 14:46:40 (+0100),
Ole Holm Nielsen wrote:
> At SC12 I suggested to you to add a check to Node Health Check to
> inquire the mcelog daemon for hardware errors and offline sick nodes
> in Torque. Along these lines, we just today had a compute node with
> a memory error causing /var/log/mcelog to fill up the / file system
We *do* have a check for / being full.... ;-)
> Reading the mcelog man-page I found that there exists a simple
> interface to inquiring the mcelogd daemon for detected hardware
> errors (please see the sample below).
> I wonder if you would consider adding a check to NHC saying that if
> the output of the command "mcelog --client" is non-empty, then the
> node should be offlined by Torque? Reading your Wiki page
> http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check it wasn't
> immediately obvious to me how to write the most efficient check
> Please note that mcelog version 1.0 is required in order to have the
> --client flag available. This works fine on CentOS 6.x, whereas
> CentOS 5.x has an older version 0.7. The mcelog project homepage is
> http://mcelog.org/ and there is a paper at
It's certainly doable. I can think of two caveats, though.
1. The /dev/mcelog ring buffer doesn't work quite like other kernel
ring buffers. Once the data is read via mcelog, it's gone. RHEL6
has an hourly cron job, /etc/cron.hourly/mcelog.cron, which would
interfere with NHC's execution of mcelog --client, and vice versa.
2. If the MCE is transient (i.e., it doesn't happen constantly), NHC
may wind up offlining the node due to the MCE check during one run
and onlining it again during the next run (because no further MCEs
have occurred in the interim). Is that acceptable?
What are your thoughts?
Michael Jennings <mej at lbl.gov>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E W: 510-495-2687
MS 050B-3209 F: 510-486-8615
More information about the torqueusers