[torqueusers] Sharing your Compute Node Health Check scripts
Brock Palen
brockp at umich.edu
Mon Aug 23 09:29:27 MDT 2010
We check:
*. Are the needed mounts there
*. Check if /tmp/ is getting above 90% full
*. Check for 'read-only' file systems in /proc/mounts (/etc/mtab isn't updated if it is on the bad filesystem)
*. Check for valid IB connection
*. Check that sshd is running (not killed by oom)
*. Check for OOM in dmesg
*. Check that the ethernet interface is running 1000Mbps and full duplex
*. Check amount of locked memory (for ib)
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985
On Aug 23, 2010, at 11:00 AM, Ole Holm Nielsen wrote:
> Can anyone implementing a node_check_script kindly share their know-how
> with us and/or the list ?
>
> We would like to implement the Torque Compute Node Health Check script feature in
> http://www.clusterresources.com/torquedocs21/10.2healthcheck.shtml
>
> How do people check for health problems such as:
> * various ways that disk failures can cripple a file system or a swap partition
> * RAM memory errors
> * out-of-memory conditions
> * disk full conditions
> * other stuff?
>
> I've asked the list about this before, but received zero responses :-(
> I hope for better luck this time...
>
> Thanks a lot,
> Ole
>
> Ole Holm Nielsen
> Department of Physics, Technical University of Denmark
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
More information about the torqueusers
mailing list