[torqueusers] Questions about pbs_server --ha

Victor Gregorio vgregorio at penguincomputing.com
Tue Apr 14 10:04:39 MDT 2009

Hey Josh,

Thanks for the feedback.  With the snapshot code, is it possible to
run two instances of pbs_server --ha where the server.lock file does not
live in the shared /var/spool/torque/server_priv folder?  I assumed that
the server.lock file was used to determine which pbs_server was primary
when more than one instance was sharing the server_priv folder.  Maybe I
am mistaken?

For now, I am following the advice from you and Ken:  using CentOS's
Heartbeat services to manage the two instances of pbs_server.  I have
had to set up the server_priv folder as a 'nolock' NFS mount in order to
allow the secondary pbs_server to pick up services without 'subsystem locked'

Victor Gregorio
Penguin Computing

On Mon, Apr 13, 2009 at 12:05:51PM -0600, Josh Butikofer wrote:
> Victor,
> The latest snapshot of TORQUE 2.3.x (not yet an officially released 
> version) allows you to configure where the lock file is stored. You could 
> then tell it to store the file in a non-NFS mounted location so that when 
> the passive becomes active it is not blocked by the server.lock file being 
> present on the NFS share.
> The downside to this is you will be using a snapshot. We hope to release 
> the next version of TORQUE in a few weeks. We are looking for users 
> willing to kick the new TORQUE's tires, however, so if you're interested 
> let us know and we'll cut you a new build.
> Another option, although more of a workaround until the new TORQUE is 
> released, is to have the CentOS heartbeat feature run a script to delete 
> the server.lock when the passive server becomes active.
> Josh Butikofer
> Cluster Resources, Inc.
> #############################
> Victor Gregorio wrote:
>> Thanks Ken,
>> Taking your advice, I configured the two pbs_servers to run an
>> active/passive HA configuration using CentOS's Heartbeat services.  I am
>> no longer running pbs_server with --ha, since only one pbs_server
>> instance will be running at a time.
>> Both primary and secondary pbs_servers still use a shared NFS partition
>> (on a third machine) for /var/spool/torque/server_priv.
>> Unfortunately, there is still a server.lock file left by the primary
>> pbs_server when is starts up.  So, when the primary system critically
>> fails, the secondary system cannot start pbs_server.
>> Thoughts?

