[torqueusers] NHC check for mcelog errors

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Wed Nov 28 04:08:03 MST 2012

On 11/27/2012 02:19 AM, Michael Jennings wrote:
>> Reading the mcelog man-page I found that there exists a simple
>> interface to inquiring the mcelogd daemon for detected hardware
>> errors (please see the sample below).
>> I wonder if you would consider adding a check to NHC saying that if
>> the output of the command "mcelog --client" is non-empty, then the
>> node should be offlined by Torque?  Reading your Wiki page
>> http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check it wasn't
>> immediately obvious to me how to write the most efficient check
>> myself.
>> Please note that mcelog version 1.0 is required in order to have the
>> --client flag available.  This works fine on CentOS 6.x, whereas
>> CentOS 5.x has an older version 0.7.  The mcelog project homepage is
>> http://mcelog.org/ and there is a paper at
>> http://halobates.de/lk10-mcelog.pdf.
> It's certainly doable.  I can think of two caveats, though.
> 1.  The /dev/mcelog ring buffer doesn't work quite like other kernel
>      ring buffers.  Once the data is read via mcelog, it's gone.  RHEL6
>      has an hourly cron job, /etc/cron.hourly/mcelog.cron, which would
>      interfere with NHC's execution of mcelog --client, and vice versa.

That sounds bad. We don't want to interfere with the usual MCE logging. 
  It seems silly for the mcelog daemon to wipe is internal log data 
every time it's inquired - are you sure the daemon behaves in that way?

I'm also confused about the mcelogd daemon logging to /var/log/mcelog 
constantly, as this would seem to conflict with the hourly cron job?

> 2.  If the MCE is transient (i.e., it doesn't happen constantly), NHC
>      may wind up offlining the node due to the MCE check during one run
>      and onlining it again during the next run (because no further MCEs
>      have occurred in the interim).  Is that acceptable?

This is no good either.  It seems we need a different approach where 
persistent MCE errors should cause the node to be offlined until next 
reboot.  Anyway, an NHC check script looking at MCE events would be 
optional for sites to experiment with.


More information about the torqueusers mailing list