[torqueusers] High availability

Garrick Staples garrick at clusterresources.com
Thu Jun 29 11:03:32 MDT 2006

On Thu, Jun 29, 2006 at 12:29:18PM +0200, Ronny T. Lampert alleged:
> Hi,
> > Hello, is anyone out there currently running TORQUE/Maui in a high
> > availability (>%99.95) situation? I have grep?d through the archives of
> > these mailing lists and the only references I see to ?high availability?
> > are from press releases. I would love to talk to people who do this in
> > real situations as I believe we can learn a lot from each other.
> not that I am experienced in that field, but you will get "problems" on the
> torque side which doesn't support multiple pbs_servers yet (and as such, no
> automatic failovers).
> I think Garrick would be the perfect person to respond as they are working
> on it.

TORQUE does support multiple servers, but the servers don't talk to each
other.  This essentially means you can have a node in multiple clusters
at once.  

moab can even talk to both servers and understand that the nodes are the
same and schedule appropriately.

But there is no failover, fault tolerance, or HA support between
servers.  I'm not actively working on this right now, but it is on the
TODO list.

> Other than that I can report torque to be almost perfectly stable once you
> get a stable setup. Lots of testing doesn't hurt, some torque version were
> plagued by mem-leaks.
> What's great: You can upgrade your nodes peu-a-peu because you can offline
> them (means, no further jobs on them but this doesnt kill existing jobs).
> You also don't have to keep ALL versions the same; mixed versions usually
> work as long as there is not too much difference.

We've done a pretty good job so far with compatibility.  I don't think
there has been a single change on the wire since the first few releases.
And now with 2.1.x, MOMs upgrade themselves!

> The only downtime I ever have is when I have to upgrade the pbs_server and
> that's usually a matter of 1 minute.
> For configuration I can suggest the following:
> --prefix=/usr/local/torque-<VERSION>
> --with-server-home=/usr/local/torque-home
> and keep /usr/local/torque linked to the actual running version.
> That way you can easily switch versions when upgrading/downgrading and
> always keep the last-know-good version around.

Tsk, tsk, server home should be in /var :)

More information about the torqueusers mailing list