[torqueusers] NHC check for mcelog errors

Michael Jennings mej at lbl.gov
Wed Jan 2 18:31:28 MST 2013


On Wednesday, 02 January 2013, at 15:48:49 (+0100),
Ole Holm Nielsen wrote:

> We've been running NHC (SVN version) for a month now, and a few days
> ago one node had a Machine Check Exception (MCE) event which was
> reported in the NHC logfile as expected:
> 
> Running check:  "check_hw_mcelog"
> check_hw_mcelog():  MCEs detected:  Memory errors
> SOCKET 1 CHANNEL any DIMM any
> corrected memory errors:
>         1 total
>         0 in 24h
> uncorrected memory errors:
>         0 total
>         0 in 24h
> 
> SOCKET 1 CHANNEL 2 DIMM any
> corrected memory errors:
>         1 total
>         1 in 24h
> uncorrected memory errors:
>         0 total
>         0 in 24h
> Health check failed:  MCEs detected in log.
> 20121231 12:03:01 /usr/libexec/nhc/node-mark-offline g032 MCEs
> detected in log.
> /usr/libexec/nhc/node-mark-offline:  Marking job-exclusive g032
> offline:  NHC: MCEs detected in log.
> 
> This was probably a transient memory hardware error, but it seems
> that the NHC check_hw_mcelog() is doing its job correctly.

Yay!  Thanks for the feedback. :)

Michael

-- 
Michael Jennings <mej at lbl.gov>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E        W: 510-495-2687
MS 050B-3209          F: 510-486-8615


More information about the torqueusers mailing list