[torqueusers] High availability
Ronny T. Lampert
telecaadmin at uni.de
Thu Jun 29 04:29:18 MDT 2006
> Hello, is anyone out there currently running TORQUE/Maui in a high
> availability (>%99.95) situation? I have grep’d through the archives of
> these mailing lists and the only references I see to “high availability”
> are from press releases. I would love to talk to people who do this in
> real situations as I believe we can learn a lot from each other.
not that I am experienced in that field, but you will get "problems" on the
torque side which doesn't support multiple pbs_servers yet (and as such, no
I think Garrick would be the perfect person to respond as they are working
Other than that I can report torque to be almost perfectly stable once you
get a stable setup. Lots of testing doesn't hurt, some torque version were
plagued by mem-leaks.
What's great: You can upgrade your nodes peu-a-peu because you can offline
them (means, no further jobs on them but this doesnt kill existing jobs).
You also don't have to keep ALL versions the same; mixed versions usually
work as long as there is not too much difference.
The only downtime I ever have is when I have to upgrade the pbs_server and
that's usually a matter of 1 minute.
For configuration I can suggest the following:
and keep /usr/local/torque linked to the actual running version.
That way you can easily switch versions when upgrading/downgrading and
always keep the last-know-good version around.
More information about the torqueusers