[torqueusers] High availability

Ronny T. Lampert telecaadmin at uni.de
Thu Jun 29 04:29:18 MDT 2006


> Hello, is anyone out there currently running TORQUE/Maui in a high
> availability (>%99.95) situation? I have grep’d through the archives of
> these mailing lists and the only references I see to “high availability”
> are from press releases. I would love to talk to people who do this in
> real situations as I believe we can learn a lot from each other.

not that I am experienced in that field, but you will get "problems" on the
torque side which doesn't support multiple pbs_servers yet (and as such, no
automatic failovers).
I think Garrick would be the perfect person to respond as they are working
on it.

Other than that I can report torque to be almost perfectly stable once you
get a stable setup. Lots of testing doesn't hurt, some torque version were
plagued by mem-leaks.

What's great: You can upgrade your nodes peu-a-peu because you can offline
them (means, no further jobs on them but this doesnt kill existing jobs).
You also don't have to keep ALL versions the same; mixed versions usually
work as long as there is not too much difference.

The only downtime I ever have is when I have to upgrade the pbs_server and
that's usually a matter of 1 minute.

For configuration I can suggest the following:


and keep /usr/local/torque linked to the actual running version.
That way you can easily switch versions when upgrading/downgrading and
always keep the last-know-good version around.


More information about the torqueusers mailing list