[torqueusers] downing a node via qmgr

Robin Humble rjh at cita.utoronto.ca
Thu Sep 22 16:10:59 MDT 2005


On Thu, Sep 22, 2005 at 01:03:49PM -0700, Garrick Staples wrote:
>On Thu, Sep 22, 2005 at 09:11:09AM +1000, Chris Samuel alleged:
>> On Thu, 22 Sep 2005 02:22 am, Stewart.Samuels at sanofi-aventis.com wrote:
>> > We currently have a node which is rebooting itself constantly.
>> I would strongly suggest that you do not start the pbs_mom automatically on a 
>> reboot via init scripts.
>> 
>> If you've rebooted the node yourself then you should restart it by hand 
>> whereas if the node dies and reboots you're probably going to want to 
>> investigate.  We do this, and only restart the mom when we've got a better 
>> handle on things and think it safe to do so.

<aol>us too</aol>

I like the nodes to stay offline whilst we see why it rebooted... or
whether the re-install worked etc. (although generally we offline them
before a re-install anyway)

>Really?  That's what you guys do on your cluster?  That sounds like a
>major hastle.

is there a better way?

we also have to wait up to 5 mins for info to flush in via ganglia
before we can restart our home-grown distributed filesystem, and we
don't want any jobs running on the node 'til then.

cheers,
robin


More information about the torqueusers mailing list