[torqueusers] NHC check for mcelog errors
Michael Jennings
mej at lbl.gov
Wed Jan 2 18:31:28 MST 2013
On Wednesday, 02 January 2013, at 15:48:49 (+0100),
Ole Holm Nielsen wrote:
> We've been running NHC (SVN version) for a month now, and a few days
> ago one node had a Machine Check Exception (MCE) event which was
> reported in the NHC logfile as expected:
>
> Running check: "check_hw_mcelog"
> check_hw_mcelog(): MCEs detected: Memory errors
> SOCKET 1 CHANNEL any DIMM any
> corrected memory errors:
> 1 total
> 0 in 24h
> uncorrected memory errors:
> 0 total
> 0 in 24h
>
> SOCKET 1 CHANNEL 2 DIMM any
> corrected memory errors:
> 1 total
> 1 in 24h
> uncorrected memory errors:
> 0 total
> 0 in 24h
> Health check failed: MCEs detected in log.
> 20121231 12:03:01 /usr/libexec/nhc/node-mark-offline g032 MCEs
> detected in log.
> /usr/libexec/nhc/node-mark-offline: Marking job-exclusive g032
> offline: NHC: MCEs detected in log.
>
> This was probably a transient memory hardware error, but it seems
> that the NHC check_hw_mcelog() is doing its job correctly.
Yay! Thanks for the feedback. :)
Michael
--
Michael Jennings <mej at lbl.gov>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E W: 510-495-2687
MS 050B-3209 F: 510-486-8615
More information about the torqueusers
mailing list