[torqueusers] downing a node via qmgr
Chris Samuel
csamuel at vpac.org
Sun Sep 25 19:42:14 MDT 2005
On Fri, 23 Sep 2005 06:03 am, Garrick Staples wrote:
> On Thu, Sep 22, 2005 at 09:11:09AM +1000, Chris Samuel alleged:
>
> > If you've rebooted the node yourself then you should restart it by hand
> > whereas if the node dies and reboots you're probably going to want to
> > investigate. We do this, and only restart the mom when we've got a
> > better handle on things and think it safe to do so.
>
> Really? That's what you guys do on your cluster? That sounds like a
> major hastle.
I suspect our clusters aren't as large as yours! :-) We've got one with 90
dual P4 Xeon compute nodes, one we run for a University with 60 dual P4 Xeon
compute nodes, one with 36 quad Power 5 compute nodes and one with 16 dual
Opteron compute nodes.
We reckon that not losing jobs that run on broken nodes far outweighs the
hassle of occasionally restarting pbs on a node.
cheers,
Chris
--
Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050926/bab9f825/attachment.bin
More information about the torqueusers
mailing list