[torqueusers] NHC check for mcelog errors

Christopher Samuel samuel at unimelb.edu.au
Thu Jan 3 16:35:00 MST 2013


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 27/11/12 00:46, Ole Holm Nielsen wrote:

> At SC12 I suggested to you to add a check to Node Health Check to 
> inquire the mcelog daemon for hardware errors and offline sick
> nodes in Torque.

It's worth nothing that our SGI hardware (rebadged SuperMicro boxes,
dual socket quad core Nehalem) runs a different MCE code (called
memlog) to the standard RHEL/CentOS one as their code can pull more
info out.  That reports to /var/log/memlog/memlog.log.

Also our IBM iDataplex nodes log MCEs via their
IMM/IPMI/BMC/RSA/whatever-its-called-today controller and so we
monitor for those by parsing the output of "ipmitool sel elist".

cheers,
Chris
- -- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iEYEARECAAYFAlDmFaQACgkQO2KABBYQAh+6iwCZAZl63YkK5xDTwV65qOQi3+xG
u/EAn2qbc7TF0rLn2l0PtwxEXiYMJiAM
=T116
-----END PGP SIGNATURE-----


More information about the torqueusers mailing list