[torqueusers] NHC check for mcelog errors
Christopher Samuel
samuel at unimelb.edu.au
Thu Jan 3 16:35:00 MST 2013
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 27/11/12 00:46, Ole Holm Nielsen wrote:
> At SC12 I suggested to you to add a check to Node Health Check to
> inquire the mcelog daemon for hardware errors and offline sick
> nodes in Torque.
It's worth nothing that our SGI hardware (rebadged SuperMicro boxes,
dual socket quad core Nehalem) runs a different MCE code (called
memlog) to the standard RHEL/CentOS one as their code can pull more
info out. That reports to /var/log/memlog/memlog.log.
Also our IBM iDataplex nodes log MCEs via their
IMM/IPMI/BMC/RSA/whatever-its-called-today controller and so we
monitor for those by parsing the output of "ipmitool sel elist".
cheers,
Chris
- --
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/
iEYEARECAAYFAlDmFaQACgkQO2KABBYQAh+6iwCZAZl63YkK5xDTwV65qOQi3+xG
u/EAn2qbc7TF0rLn2l0PtwxEXiYMJiAM
=T116
-----END PGP SIGNATURE-----
More information about the torqueusers
mailing list