[torqueusers] NHC check for mcelog errors (Christopher Samuel)
samuel at unimelb.edu.au
Tue Jan 8 23:55:36 MST 2013
-----BEGIN PGP SIGNED MESSAGE-----
On 08/01/13 19:05, Ole Holm Nielsen wrote:
> The BMC error log is worth checking whenever a node is really
> broken, but I doubt whether it makes sense for running frequently
> as an NHC check for these reasons:
> * You need to start the IPMI service ("service ipmi start" on
> RHEL6), and in my experience with Intel systems this always incurs
> an unwarranted additional CPU load of 1.0 :-(
We're running RHEL 5 so it doesn't do that. It's worth checking
though if it's actually consuming CPU or just sitting in a device
wait. If it's just in a device wait then it won't have any impact.
> * The time it takes to start IPMI, run "ipmitool sel elist", then
> stop IPMI is substantial (several seconds), hence it may not be
> suitable as an NHC check which should be as quick as possible.
I think that depends on how often you run them, we run ours every 15
minutes and they write to a file in /dev/shm. The check that pbs_mom
runs just cats that file (or produces an error if it's not present).
> * The IPMI error log is cumulative, so one would have to look for
We clean our IPMI logs after the node has been fixed.
> Also, some BMCs do not seem to have reliable time/date, making
> timestamps unreliable.
Because our scripts log to syslog when they find a problem then we
should know to a 15 minute window when the problem occurred.
> My 2 cents worth...
I think it's worth a lot more than that, they're all good points that
need to be considered!
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/
-----END PGP SIGNATURE-----
More information about the torqueusers