[torqueusers] NHC check for mcelog errors (Christopher Samuel)

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Wed Jan 9 01:00:20 MST 2013

Christopher Samuel wrote:
>> The BMC error log is worth checking whenever a node is really
>> broken, but I doubt whether it makes sense for running frequently
>> as an NHC check for these reasons:
>> * You need to start the IPMI service ("service ipmi start" on
>> RHEL6), and in my experience with Intel systems this always incurs
>> an unwarranted additional CPU load of 1.0 :-(

> We're running RHEL 5 so it doesn't do that.  It's worth checking
> though if it's actually consuming CPU or just sitting in a device
> wait.  If it's just in a device wait then it won't have any impact.

We see the extra CPU load on both CentOS5 and CentOS6 nodes. How does 
one determine if a CPU load is "harmless"?  We're also seeing extra CPU 
loads of about 0.5 on our new nodes with QDR Infiniband adapters, 
perhaps that's also due to device waits?

We check the compute nodes' CPU load in order to identify badly behaving 
jobs, so even "harmless" CPU loads disturb our monitoring.

>> * The time it takes to start IPMI, run "ipmitool sel elist", then
>> stop IPMI is substantial (several seconds), hence it may not be
>> suitable as an NHC check which should be as quick as possible.
> I think that depends on how often you run them, we run ours every 15
> minutes and they write to a file in /dev/shm. The check that pbs_mom
> runs just cats that file (or produces an error if it's not present).

Interesting!  Perhaps you could share your scripts and setup with the 
Torque community?  Do you see any job performance impacts due to running 
IPMI commands on busy nodes?

>> * The IPMI error log is cumulative, so one would have to look for
>> changes.
> We clean our IPMI logs after the node has been fixed.

Yes, but our (older) BMCs log a lot of irrelevant stuff which do not 
warrant the offlining of the nodes.  How would one identify genuinely 
broken hardware from the BMC error logs?

>> Also, some BMCs do not seem to have reliable time/date, making
>> timestamps unreliable.
> Because our scripts log to syslog when they find a problem then we
> should know to a 15 minute window when the problem occurred.

Yes, this sounds like a good method.


Ole Holm Nielsen
Department of Physics, Technical University of Denmark

More information about the torqueusers mailing list