[torqueusers] NHC check for mcelog errors (Christopher Samuel)

Christopher Samuel samuel at unimelb.edu.au
Mon Jan 14 23:10:28 MST 2013


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 09/01/13 19:00, Ole Holm Nielsen wrote:

> Christopher Samuel wrote:
> 
>> We're running RHEL 5 so it doesn't do that.  It's worth checking 
>> though if it's actually consuming CPU or just sitting in a
>> device wait.  If it's just in a device wait then it won't have
>> any impact.
> 
> We see the extra CPU load on both CentOS5 and CentOS6 nodes.

Very odd. We don't see that here on RHEL5.  memlogd does add a load of
1 to a node, but it's not using any CPU.

> How does  one determine if a CPU load is "harmless"?

"top" is your friend, test on a completely idle node and use 'i' to
tell it to not show idle tasks.  Things that are in 'D' are in a
device wait.

> We're also seeing extra CPU loads of about 0.5 on our new nodes
> with QDR Infiniband adapters,  perhaps that's also due to device
> waits?

It is possible.

> We check the compute nodes' CPU load in order to identify badly
> behaving jobs, so even "harmless" CPU loads disturb our
> monitoring.

We use cpusets to restrict jobs to just the cores they use, and
Open-MPI with TM support to make sure MPI jobs only launch where they
are supposed to.  If a job uses more cores than its meant to inside
its cpuset then it only harms itself.

[IPMI]
>> I think that depends on how often you run them, we run ours every
>> 15 minutes and they write to a file in /dev/shm. The check that
>> pbs_mom runs just cats that file (or produces an error if it's
>> not present).
> 
> Interesting!  Perhaps you could share your scripts and setup with
> the Torque community?

Could do, would need to find a good way to put them up though.

> Do you see any job performance impacts due to running IPMI
> commands on busy nodes?

We've not had any complaints, but then we're catering to life sciences
where it's not uncommon to see people running Java, R, Perl and Python
as their HPC codes.. :-(

> Yes, but our (older) BMCs log a lot of irrelevant stuff which do
> not warrant the offlining of the nodes.  How would one identify
> genuinely broken hardware from the BMC error logs?

At the moment we just look for errors we've seen in the past, and
check that the log isn't full.  Not perfect, but it's worked OK so
far. :-)

cheers!
Chris
- -- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlD08tQACgkQO2KABBYQAh969ACfT0mPJcwNY4iPlobsEeYtOPJs
OLAAnizeaXwIjbBJescOzD6DnsqXWBrY
=/AnR
-----END PGP SIGNATURE-----


More information about the torqueusers mailing list