[torqueusers] NHC check for mcelog errors
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Mon Nov 26 06:46:40 MST 2012
Hi Michael,
At SC12 I suggested to you to add a check to Node Health Check to
inquire the mcelog daemon for hardware errors and offline sick nodes in
Torque. Along these lines, we just today had a compute node with a
memory error causing /var/log/mcelog to fill up the / file system :-)
Reading the mcelog man-page I found that there exists a simple interface
to inquiring the mcelogd daemon for detected hardware errors (please see
the sample below).
I wonder if you would consider adding a check to NHC saying that if the
output of the command "mcelog --client" is non-empty, then the node
should be offlined by Torque? Reading your Wiki page
http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check it wasn't
immediately obvious to me how to write the most efficient check myself.
Please note that mcelog version 1.0 is required in order to have the
--client flag available. This works fine on CentOS 6.x, whereas CentOS
5.x has an older version 0.7. The mcelog project homepage is
http://mcelog.org/ and there is a paper at
http://halobates.de/lk10-mcelog.pdf.
Best regards,
Ole
Sample output
-------------
[root at g013 ~]# mcelog --client
Memory errors
SOCKET 1 CHANNEL any DIMM any
corrected memory errors:
9325 total
0 in 24h
uncorrected memory errors:
0 total
0 in 24h
SOCKET 1 CHANNEL 1 DIMM any
corrected memory errors:
9325 total
9325 in 24h
uncorrected memory errors:
0 total
0 in 24h
Per page corrected memory statistics:
ce4829000: total 1 seen "1 in 24h" online
(many lines deleted)
--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark
More information about the torqueusers
mailing list