[torqueusers] Sharing your Compute Node Health Check scripts

Brock Palen brockp at umich.edu
Mon Aug 23 09:29:27 MDT 2010


We check:

*. Are the needed mounts there
*. Check if /tmp/ is getting above 90% full
*. Check for 'read-only' file systems in /proc/mounts  (/etc/mtab isn't updated if it is on the bad filesystem)
*. Check for valid IB connection
*. Check that sshd is running (not killed by oom)
*. Check for OOM in dmesg
*. Check that the ethernet interface is running 1000Mbps and full duplex
*. Check amount of locked memory (for ib)

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985



On Aug 23, 2010, at 11:00 AM, Ole Holm Nielsen wrote:

> Can anyone implementing a node_check_script kindly share their know-how
> with us and/or the list ?
> 
> We would like to implement the Torque Compute Node Health Check script feature in
> http://www.clusterresources.com/torquedocs21/10.2healthcheck.shtml
> 
> How do people check for health problems such as:
> * various ways that disk failures can cripple a file system or a swap partition
> * RAM memory errors
> * out-of-memory conditions
> * disk full conditions
> * other stuff?
> 
> I've asked the list about this before, but received zero responses :-(
> I hope for better luck this time...
> 
> Thanks a lot,
> Ole
> 
> Ole Holm Nielsen
> Department of Physics, Technical University of Denmark
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 



More information about the torqueusers mailing list