[torqueusers] NHC check for mcelog errors (Christopher Samuel)

Christopher Samuel samuel at unimelb.edu.au
Tue Jan 8 23:55:36 MST 2013

Hash: SHA1

Hi Ole,

On 08/01/13 19:05, Ole Holm Nielsen wrote:

> The BMC error log is worth checking whenever a node is really
> broken, but I doubt whether it makes sense for running frequently
> as an NHC check for these reasons:
> * You need to start the IPMI service ("service ipmi start" on
> RHEL6), and in my experience with Intel systems this always incurs
> an unwarranted additional CPU load of 1.0 :-(

We're running RHEL 5 so it doesn't do that.  It's worth checking
though if it's actually consuming CPU or just sitting in a device
wait.  If it's just in a device wait then it won't have any impact.

> * The time it takes to start IPMI, run "ipmitool sel elist", then
> stop IPMI is substantial (several seconds), hence it may not be
> suitable as an NHC check which should be as quick as possible.

I think that depends on how often you run them, we run ours every 15
minutes and they write to a file in /dev/shm. The check that pbs_mom
runs just cats that file (or produces an error if it's not present).

> * The IPMI error log is cumulative, so one would have to look for 
> changes.

We clean our IPMI logs after the node has been fixed.

> Also, some BMCs do not seem to have reliable time/date, making 
> timestamps unreliable.

Because our scripts log to syslog when they find a problem then we
should know to a 15 minute window when the problem occurred.

> My 2 cents worth...

I think it's worth a lot more than that, they're all good points that
need to be considered!

- -- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/


More information about the torqueusers mailing list