[torqueusers] NHC check for mcelog errors (Christopher Samuel)

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Tue Jan 8 01:05:11 MST 2013

Christopher Samuel <samuel at unimelb.edu.au> wrote:
> It's worth nothing that our SGI hardware (rebadged SuperMicro boxes,
> dual socket quad core Nehalem) runs a different MCE code (called
> memlog) to the standard RHEL/CentOS one as their code can pull more
> info out.  That reports to /var/log/memlog/memlog.log.

Sounds interesting.  I googled for memlogd and apparently it reports 
solely memory errors, which is certainly a very important check. 
Perhaps Michael could be persuaded to write yet another Node Health 
Check for memlog, but he would have to check for the addition of new 
errors to the cumulative logfile.

> Also our IBM iDataplex nodes log MCEs via their
> IMM/IPMI/BMC/RSA/whatever-its-called-today controller and so we
> monitor for those by parsing the output of "ipmitool sel elist".

The BMC error log is worth checking whenever a node is really broken, 
but I doubt whether it makes sense for running frequently as an NHC 
check for these reasons:

* You need to start the IPMI service ("service ipmi start" on RHEL6), 
and in my experience with Intel systems this always incurs an 
unwarranted additional CPU load of 1.0 :-(
* The time it takes to start IPMI, run "ipmitool sel elist", then stop 
IPMI is substantial (several seconds), hence it may not be suitable as 
an NHC check which should be as quick as possible.
* The IPMI error log is cumulative, so one would have to look for 
changes.  Also, some BMCs do not seem to have reliable time/date, making 
timestamps unreliable.

My 2 cents worth...

Best regards,

Ole Holm Nielsen
Department of Physics, Technical University of Denmark

More information about the torqueusers mailing list