[torqueusers] NHC check for mcelog errors

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Wed Jan 2 07:48:49 MST 2013


On 11/30/2012 03:03 AM, Michael Jennings wrote:
> I've added check_hw_mcelog to SVN.  Feel free to download and give it
> a try.  The SVN repository is at:  https://warewulf.lbl.gov/svn/trunk/nhc
> Packages can be easily built from SVN via:
>    $ ./autogen.sh && make distcheck && rpmbuild -ta warewulf-nhc-*.tar.gz
>
> The check is pretty simple at the moment.  If mcelog --client returns
> anything (other than "Connection refused," meaning no daemon is
> running), the check fails.  Otherwise, it passes.

We've been running NHC (SVN version) for a month now, and a few days ago 
one node had a Machine Check Exception (MCE) event which was reported in 
the NHC logfile as expected:

Running check:  "check_hw_mcelog"
check_hw_mcelog():  MCEs detected:  Memory errors
SOCKET 1 CHANNEL any DIMM any
corrected memory errors:
         1 total
         0 in 24h
uncorrected memory errors:
         0 total
         0 in 24h

SOCKET 1 CHANNEL 2 DIMM any
corrected memory errors:
         1 total
         1 in 24h
uncorrected memory errors:
         0 total
         0 in 24h
Health check failed:  MCEs detected in log.
20121231 12:03:01 /usr/libexec/nhc/node-mark-offline g032 MCEs detected 
in log.
/usr/libexec/nhc/node-mark-offline:  Marking job-exclusive g032 offline: 
  NHC: MCEs detected in log.

This was probably a transient memory hardware error, but it seems that 
the NHC check_hw_mcelog() is doing its job correctly.

Best regards,
Ole

-- 
Ole Holm Nielsen
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark


More information about the torqueusers mailing list