[torqueusers] Sharing your Compute Node Health Check scripts

Wickliffe, Blake W blake.wickliffe at aramco.com
Mon Aug 23 22:39:57 MDT 2010


Hi,

We use the nodeCheck.sh script I've attached.  It contains all the logic for doing tests, but the tests and data are kept in a separate text file (which I've attached a sample, nodeCheck.txt), so it's extendable.

So, as you can see in the data file, you can define any number of short, one-line tests which you then assign a variable name to.  From the example:

<Keyword>,  <Variable Name>, <Variable Test>
#NodeCheckCmd, cvfs_fs, mount -t cvfs | wc -l

If you want to test something that takes a bit more scripting, you use a different keyword, and point the script out to it:
#NodeCheckScript, var_full, ./SCRIPTS/CheckVarFull.sh


Finally, you define which tests you want to run per node (all in the same file) and what the correct answers should be, eg:

node001: cvfs_fs=0 nfs_fs=9 num_cpus=8 tmp_fs_rw=$TIMESTAMP var_full=0 eth_errors=0

We keep this script on a shared filesystem, and you just run it without any arguments.  For the Torque functionality, you just grep out the word "Summary".


Hope that's useful!

Blake Wickliffe
Saudi Aramco
ENOD/CSYS/USG HPC Team
(873-4417)


-----Original Message-----
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ole Holm Nielsen
Sent: Monday, August 23, 2010 6:00 PM
To: torqueusers at supercluster.org
Subject: [torqueusers] Sharing your Compute Node Health Check scripts

Can anyone implementing a node_check_script kindly share their know-how
with us and/or the list ?

We would like to implement the Torque Compute Node Health Check script feature in
http://www.clusterresources.com/torquedocs21/10.2healthcheck.shtml

How do people check for health problems such as:
* various ways that disk failures can cripple a file system or a swap partition
* RAM memory errors
* out-of-memory conditions
* disk full conditions
* other stuff?

I've asked the list about this before, but received zero responses :-(
I hope for better luck this time...

Thanks a lot,
Ole

Ole Holm Nielsen
Department of Physics, Technical University of Denmark
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

The contents of this email, including all related responses, files and attachments transmitted with it (collectively referred to as “this Email”), are intended solely for the use of the individual/entity to whom/which they are addressed, and may contain confidential and/or legally privileged information. This Email may not be disclosed or forwarded to anyone else without authorization from the originator of this Email. If you have received this Email in error, please notify the sender immediately and delete all copies from your system. Please note that the views or opinions presented in this Email are those of the author and may not necessarily represent those of Saudi Aramco. The recipient should check this Email and any attachments for the presence of any viruses. Saudi Aramco accepts no liability for any damage caused by any virus/error transmitted by this Email.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nodeCheck.sh
Type: application/octet-stream
Size: 2430 bytes
Desc: nodeCheck.sh
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20100824/4cb18017/attachment.obj 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: nodeCheck.txt
Url: http://www.supercluster.org/pipermail/torqueusers/attachments/20100824/4cb18017/attachment.txt 


More information about the torqueusers mailing list