[torquedev] node_check_script

Gareth.Williams at csiro.au Gareth.Williams at csiro.au
Thu Feb 5 03:52:01 MST 2009


Hi torqueusers and torquedev'rs,

We recently set up a node_check_script (http://www.clusterresources.com/torquedocs21/10.2healthcheck.shtml) and it has already helped a number of times in preventing jobs going to broken or mis-configured nodes. However, there seems to be a bug in the logic (and documentation) for when to run the script.  I initially set node_check_interval to 1, but 'momctl -d' reported (incorrectly?) that the interval between checks was 1 second!  I was alarmed enough to change the value to 180, but later observed that the script did not seem to be running much (did not flag nodes were okay after they were fixed).  Now the documentation also states that node_check_interval is a multiplier for the MOM interval (which we have left at the default 45 s), so 180 gives 2 hrs 15 min.  There was no evidence in the logs of when it ran so I modified our script to leave a list of times it ran and found curious results.  There seem to be two distinct patterns of how often the script runs.  Since I changed the script 31.5 hrs ago, most nodes have run it either exactly 40 or 656 times.  The 40 times are not regular but do correspond on the nodes I checked. Here are some pasted values: 
Thu Feb  5 11:09:12 EST 2009    Thu Feb  5 11:09:12 EST 2009
Thu Feb  5 11:11:11 EST 2009    Thu Feb  5 11:11:11 EST 2009
Thu Feb  5 12:06:29 EST 2009    Thu Feb  5 12:06:45 EST 2009
Thu Feb  5 12:48:05 EST 2009    Thu Feb  5 12:48:05 EST 2009
Thu Feb  5 14:00:11 EST 2009    Thu Feb  5 14:00:11 EST 2009
Thu Feb  5 14:21:29 EST 2009    Thu Feb  5 14:21:46 EST 2009
Thu Feb  5 16:36:29 EST 2009    Thu Feb  5 16:36:48 EST 2009
Thu Feb  5 17:07:12 EST 2009    Thu Feb  5 17:07:11 EST 2009
Thu Feb  5 18:51:29 EST 2009    Thu Feb  5 18:51:48 EST 2009
Thu Feb  5 21:06:29 EST 2009    Thu Feb  5 21:06:48 EST 2009
The 656 times are at regular 3 minute (180s) intervals.

Does anybody else observe similar behaviour?

What might be going on and where might I look for the source of the difference?

What is the intended behaviour and does the documentation (or code) need fixing?

We're using torque 2.3.6 with a mix of 32 and 64 bit intel nodes running SLES10 and a mix of memory sizes - but the node_check difference does not have a consistent pattern with the nodes differences.

Regards,

Gareth Williams, CSIRO


More information about the torquedev mailing list