[torqueusers] Sharing your Compute Node Health Check scripts

Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu
Mon Aug 23 09:13:13 MDT 2010


hi ole,

attached is a script that we use locally on a few clusters.
it is fairly simple and tests for a working myrinet, infiniband
and "machine check errors". it would be pretty straightforward
to extend it with other checks.

cheers,
   axel.

On Mon, Aug 23, 2010 at 11:00 AM, Ole Holm Nielsen
<Ole.H.Nielsen at fysik.dtu.dk> wrote:
> Can anyone implementing a node_check_script kindly share their know-how
> with us and/or the list ?
>
> We would like to implement the Torque Compute Node Health Check script feature in
> http://www.clusterresources.com/torquedocs21/10.2healthcheck.shtml
>
> How do people check for health problems such as:
> * various ways that disk failures can cripple a file system or a swap partition
> * RAM memory errors
> * out-of-memory conditions
> * disk full conditions
> * other stuff?
>
> I've asked the list about this before, but received zero responses :-(
> I hope for better luck this time...
>
> Thanks a lot,
> Ole
>
> Ole Holm Nielsen
> Department of Physics, Technical University of Denmark
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
Dr. Axel Kohlmeyer    akohlmey at gmail.com
http://sites.google.com/site/akohlmey/

Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: health_check.sh
Type: application/x-sh
Size: 1970 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20100823/2c6d3f3d/attachment.sh 


More information about the torqueusers mailing list