[torqueusers] NHC check for mcelog errors

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Mon Nov 26 06:46:40 MST 2012


Hi Michael,

At SC12 I suggested to you to add a check to Node Health Check to 
inquire the mcelog daemon for hardware errors and offline sick nodes in 
Torque.  Along these lines, we just today had a compute node with a 
memory error causing /var/log/mcelog to fill up the / file system :-)

Reading the mcelog man-page I found that there exists a simple interface 
to inquiring the mcelogd daemon for detected hardware errors (please see 
the sample below).

I wonder if you would consider adding a check to NHC saying that if the 
output of the command "mcelog --client" is non-empty, then the node 
should be offlined by Torque?  Reading your Wiki page 
http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check it wasn't 
immediately obvious to me how to write the most efficient check myself.

Please note that mcelog version 1.0 is required in order to have the 
--client flag available.  This works fine on CentOS 6.x, whereas CentOS 
5.x has an older version 0.7.  The mcelog project homepage is 
http://mcelog.org/ and there is a paper at 
http://halobates.de/lk10-mcelog.pdf.

Best regards,
Ole

Sample output
-------------

[root at g013 ~]# mcelog --client
Memory errors
SOCKET 1 CHANNEL any DIMM any
corrected memory errors:
	9325 total
	0 in 24h
uncorrected memory errors:
	0 total
	0 in 24h

SOCKET 1 CHANNEL 1 DIMM any
corrected memory errors:
	9325 total
	9325 in 24h
uncorrected memory errors:
	0 total
	0 in 24h
Per page corrected memory statistics:
ce4829000: total 1 seen "1 in 24h" online
(many lines deleted)

-- 
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


More information about the torqueusers mailing list