[torqueusers] NHC check for mcelog errors (Christopher Samuel)
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Tue Jan 8 01:05:11 MST 2013
Christopher Samuel <samuel at unimelb.edu.au> wrote:
> It's worth nothing that our SGI hardware (rebadged SuperMicro boxes,
> dual socket quad core Nehalem) runs a different MCE code (called
> memlog) to the standard RHEL/CentOS one as their code can pull more
> info out. That reports to /var/log/memlog/memlog.log.
Sounds interesting. I googled for memlogd and apparently it reports
solely memory errors, which is certainly a very important check.
Perhaps Michael could be persuaded to write yet another Node Health
Check for memlog, but he would have to check for the addition of new
errors to the cumulative logfile.
> Also our IBM iDataplex nodes log MCEs via their
> IMM/IPMI/BMC/RSA/whatever-its-called-today controller and so we
> monitor for those by parsing the output of "ipmitool sel elist".
The BMC error log is worth checking whenever a node is really broken,
but I doubt whether it makes sense for running frequently as an NHC
check for these reasons:
* You need to start the IPMI service ("service ipmi start" on RHEL6),
and in my experience with Intel systems this always incurs an
unwarranted additional CPU load of 1.0 :-(
* The time it takes to start IPMI, run "ipmitool sel elist", then stop
IPMI is substantial (several seconds), hence it may not be suitable as
an NHC check which should be as quick as possible.
* The IPMI error log is cumulative, so one would have to look for
changes. Also, some BMCs do not seem to have reliable time/date, making
timestamps unreliable.
My 2 cents worth...
Best regards,
Ole
--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark
More information about the torqueusers
mailing list