[torqueusers] NHC check for mcelog errors

Michael Jennings mej at lbl.gov
Thu Nov 29 19:03:13 MST 2012

On Wednesday, 28 November 2012, at 12:08:03 (+0100),
Ole Holm Nielsen wrote:

> That sounds bad. We don't want to interfere with the usual MCE
> logging.  It seems silly for the mcelog daemon to wipe is internal
> log data every time it's inquired - are you sure the daemon behaves
> in that way?

I'm just going by what's in the man page.  I don't currently have any
systems getting MCEs with which I can test the behavior.  :-)

> I'm also confused about the mcelogd daemon logging to
> /var/log/mcelog constantly, as this would seem to conflict with the
> hourly cron job?

Me too.  I think someone with a test candidate will have to see what
happens with this and report back.

Any takers?  :-)

> This is no good either.  It seems we need a different approach where
> persistent MCE errors should cause the node to be offlined until
> next reboot.  Anyway, an NHC check script looking at MCE events
> would be optional for sites to experiment with.

I've added check_hw_mcelog to SVN.  Feel free to download and give it
a try.  The SVN repository is at:  https://warewulf.lbl.gov/svn/trunk/nhc
Packages can be easily built from SVN via:
  $ ./autogen.sh && make distcheck && rpmbuild -ta warewulf-nhc-*.tar.gz

The check is pretty simple at the moment.  If mcelog --client returns
anything (other than "Connection refused," meaning no daemon is
running), the check fails.  Otherwise, it passes.

Suggestions, additions, etc. are welcome!


