[torqueusers] Questions about pbs_server --ha
Victor Gregorio
vgregorio at penguincomputing.com
Tue Apr 14 10:04:39 MDT 2009
Hey Josh,
Thanks for the feedback. With the snapshot code, is it possible to
run two instances of pbs_server --ha where the server.lock file does not
live in the shared /var/spool/torque/server_priv folder? I assumed that
the server.lock file was used to determine which pbs_server was primary
when more than one instance was sharing the server_priv folder. Maybe I
am mistaken?
For now, I am following the advice from you and Ken: using CentOS's
Heartbeat services to manage the two instances of pbs_server. I have
had to set up the server_priv folder as a 'nolock' NFS mount in order to
allow the secondary pbs_server to pick up services without 'subsystem locked'
errors.
--
Victor Gregorio
Penguin Computing
On Mon, Apr 13, 2009 at 12:05:51PM -0600, Josh Butikofer wrote:
> Victor,
>
> The latest snapshot of TORQUE 2.3.x (not yet an officially released
> version) allows you to configure where the lock file is stored. You could
> then tell it to store the file in a non-NFS mounted location so that when
> the passive becomes active it is not blocked by the server.lock file being
> present on the NFS share.
>
> The downside to this is you will be using a snapshot. We hope to release
> the next version of TORQUE in a few weeks. We are looking for users
> willing to kick the new TORQUE's tires, however, so if you're interested
> let us know and we'll cut you a new build.
>
> Another option, although more of a workaround until the new TORQUE is
> released, is to have the CentOS heartbeat feature run a script to delete
> the server.lock when the passive server becomes active.
>
> Josh Butikofer
> Cluster Resources, Inc.
> #############################
>
>
> Victor Gregorio wrote:
>> Thanks Ken,
>>
>> Taking your advice, I configured the two pbs_servers to run an
>> active/passive HA configuration using CentOS's Heartbeat services. I am
>> no longer running pbs_server with --ha, since only one pbs_server
>> instance will be running at a time.
>>
>> Both primary and secondary pbs_servers still use a shared NFS partition
>> (on a third machine) for /var/spool/torque/server_priv.
>>
>> Unfortunately, there is still a server.lock file left by the primary
>> pbs_server when is starts up. So, when the primary system critically
>> fails, the secondary system cannot start pbs_server.
>>
>> Thoughts?
>>
More information about the torqueusers
mailing list