[torqueusers] Sharing your Compute Node Health Check scripts
akohlmey at cmm.chem.upenn.edu
Mon Aug 23 09:13:13 MDT 2010
attached is a script that we use locally on a few clusters.
it is fairly simple and tests for a working myrinet, infiniband
and "machine check errors". it would be pretty straightforward
to extend it with other checks.
On Mon, Aug 23, 2010 at 11:00 AM, Ole Holm Nielsen
<Ole.H.Nielsen at fysik.dtu.dk> wrote:
> Can anyone implementing a node_check_script kindly share their know-how
> with us and/or the list ?
> We would like to implement the Torque Compute Node Health Check script feature in
> How do people check for health problems such as:
> * various ways that disk failures can cripple a file system or a swap partition
> * RAM memory errors
> * out-of-memory conditions
> * disk full conditions
> * other stuff?
> I've asked the list about this before, but received zero responses :-(
> I hope for better luck this time...
> Thanks a lot,
> Ole Holm Nielsen
> Department of Physics, Technical University of Denmark
> torqueusers mailing list
> torqueusers at supercluster.org
Dr. Axel Kohlmeyer akohlmey at gmail.com
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 1970 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20100823/2c6d3f3d/attachment.sh
More information about the torqueusers