[torqueusers] NHC check for mcelog errors
Michael Jennings
mej at lbl.gov
Thu Nov 29 19:03:13 MST 2012
On Wednesday, 28 November 2012, at 12:08:03 (+0100),
Ole Holm Nielsen wrote:
> That sounds bad. We don't want to interfere with the usual MCE
> logging. It seems silly for the mcelog daemon to wipe is internal
> log data every time it's inquired - are you sure the daemon behaves
> in that way?
I'm just going by what's in the man page. I don't currently have any
systems getting MCEs with which I can test the behavior. :-)
> I'm also confused about the mcelogd daemon logging to
> /var/log/mcelog constantly, as this would seem to conflict with the
> hourly cron job?
Me too. I think someone with a test candidate will have to see what
happens with this and report back.
Any takers? :-)
> This is no good either. It seems we need a different approach where
> persistent MCE errors should cause the node to be offlined until
> next reboot. Anyway, an NHC check script looking at MCE events
> would be optional for sites to experiment with.
I've added check_hw_mcelog to SVN. Feel free to download and give it
a try. The SVN repository is at: https://warewulf.lbl.gov/svn/trunk/nhc
Packages can be easily built from SVN via:
$ ./autogen.sh && make distcheck && rpmbuild -ta warewulf-nhc-*.tar.gz
The check is pretty simple at the moment. If mcelog --client returns
anything (other than "Connection refused," meaning no daemon is
running), the check fails. Otherwise, it passes.
Suggestions, additions, etc. are welcome!
Michael
--
Michael Jennings <mej at lbl.gov>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E W: 510-495-2687
MS 050B-3209 F: 510-486-8615
More information about the torqueusers
mailing list