[torqueusers] NHC check for mcelog errors
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Wed Jan 2 07:48:49 MST 2013
On 11/30/2012 03:03 AM, Michael Jennings wrote:
> I've added check_hw_mcelog to SVN. Feel free to download and give it
> a try. The SVN repository is at: https://warewulf.lbl.gov/svn/trunk/nhc
> Packages can be easily built from SVN via:
> $ ./autogen.sh && make distcheck && rpmbuild -ta warewulf-nhc-*.tar.gz
>
> The check is pretty simple at the moment. If mcelog --client returns
> anything (other than "Connection refused," meaning no daemon is
> running), the check fails. Otherwise, it passes.
We've been running NHC (SVN version) for a month now, and a few days ago
one node had a Machine Check Exception (MCE) event which was reported in
the NHC logfile as expected:
Running check: "check_hw_mcelog"
check_hw_mcelog(): MCEs detected: Memory errors
SOCKET 1 CHANNEL any DIMM any
corrected memory errors:
1 total
0 in 24h
uncorrected memory errors:
0 total
0 in 24h
SOCKET 1 CHANNEL 2 DIMM any
corrected memory errors:
1 total
1 in 24h
uncorrected memory errors:
0 total
0 in 24h
Health check failed: MCEs detected in log.
20121231 12:03:01 /usr/libexec/nhc/node-mark-offline g032 MCEs detected
in log.
/usr/libexec/nhc/node-mark-offline: Marking job-exclusive g032 offline:
NHC: MCEs detected in log.
This was probably a transient memory hardware error, but it seems that
the NHC check_hw_mcelog() is doing its job correctly.
Best regards,
Ole
--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
More information about the torqueusers
mailing list