[torqueusers] downing a node via qmgr

Chris Samuel csamuel at vpac.org
Wed Sep 21 17:11:09 MDT 2005

On Thu, 22 Sep 2005 02:22 am, Stewart.Samuels at sanofi-aventis.com wrote:

> We currently have a node which is rebooting itself constantly.

I would strongly suggest that you do not start the pbs_mom automatically on a 
reboot via init scripts.

If you've rebooted the node yourself then you should restart it by hand 
whereas if the node dies and reboots you're probably going to want to 
investigate.  We do this, and only restart the mom when we've got a better 
handle on things and think it safe to do so.

I believe that almost all of our node reboots have been due to hardware 
problems, software issues seem to manifest as hangs instead.  I guess the 
exception to that may be if you're using a watchdog of some sort to reboot if 
a node wedges itself.

> To take the system out of the cluster to diagnose the problem, I have
> specify the following command:
>         qmgr -c 's n node-name state=down'

In this situation you really want to mark the node offline instead.

My understanding is that "offline" is an administrative status whereas the 
"down" is determined by the pbs_server and is transient depending on what's 
going on with the communication between server and mom.

good luck!
 Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050922/96ce1e43/attachment.bin

More information about the torqueusers mailing list