[torqueusers] downing a node via qmgr
csamuel at vpac.org
Wed Sep 21 17:11:09 MDT 2005
On Thu, 22 Sep 2005 02:22 am, Stewart.Samuels at sanofi-aventis.com wrote:
> We currently have a node which is rebooting itself constantly.
I would strongly suggest that you do not start the pbs_mom automatically on a
reboot via init scripts.
If you've rebooted the node yourself then you should restart it by hand
whereas if the node dies and reboots you're probably going to want to
investigate. We do this, and only restart the mom when we've got a better
handle on things and think it safe to do so.
I believe that almost all of our node reboots have been due to hardware
problems, software issues seem to manifest as hangs instead. I guess the
exception to that may be if you're using a watchdog of some sort to reboot if
a node wedges itself.
> To take the system out of the cluster to diagnose the problem, I have
> specify the following command:
> qmgr -c 's n node-name state=down'
In this situation you really want to mark the node offline instead.
My understanding is that "offline" is an administrative status whereas the
"down" is determined by the pbs_server and is transient depending on what's
going on with the communication between server and mom.
Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050922/96ce1e43/attachment.bin
More information about the torqueusers