[torqueusers] NHC check for mcelog errors (Christopher Samuel)

Jonathan Barber jonathan.barber at gmail.com
Wed Jan 9 03:24:44 MST 2013


On 9 January 2013 08:00, Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk> wrote:
> Christopher Samuel wrote:
>>> The BMC error log is worth checking whenever a node is really
>>> broken, but I doubt whether it makes sense for running frequently
>>> as an NHC check for these reasons:
>>>
>>> * You need to start the IPMI service ("service ipmi start" on
>>> RHEL6), and in my experience with Intel systems this always incurs
>>> an unwarranted additional CPU load of 1.0 :-(
>
>> We're running RHEL 5 so it doesn't do that.  It's worth checking
>> though if it's actually consuming CPU or just sitting in a device
>> wait.  If it's just in a device wait then it won't have any impact.
>
> We see the extra CPU load on both CentOS5 and CentOS6 nodes. How does
> one determine if a CPU load is "harmless"?  We're also seeing extra CPU
> loads of about 0.5 on our new nodes with QDR Infiniband adapters,
> perhaps that's also due to device waits?

Pedantic point: it's "load" and not "CPU load", because it reflects
the number of processes waiting to get a run slot and not CPU usage.
>From RHEL6 proc(5) load is:
... number of jobs in the run queue (state R) or waiting for disk I/O (state D)

To see what exactly the processes are stuck in, check the WCHAN status
of the process:
$ ps -e -w -o pid,pcpu,rss,vsize,state,cmd,wchan

Whether it's harmless or not really depends on whether it affects your
jobs... If it doesn't, it's harmless!

> We check the compute nodes' CPU load in order to identify badly behaving
> jobs, so even "harmless" CPU loads disturb our monitoring.

If your nodes have a higher basal load, why not just increase your
alerting threshold?

If you don't want any additional load from the IPMI modules, then you
can do IPMI-over-LAN from a central monitoring host (assuming you have
configured it and the node and the interfaces are connected). This
communicates directly with the BMC and thus has no effect on the
compute node. You can normally configure the IPMI LAN interface with
the ipmitool command:
# ipmitool lan print
...
# ipmitool lan set 2 ipaddr 192.168.1.1

See ipmitool(1) for the details.

I think it's also worth checking the IPMI SDR in case something
doesn't turn up in the SEL:
# ipmitool sdr list

Cheers
-- 
Jonathan Barber <jonathan.barber at gmail.com>


More information about the torqueusers mailing list