[torqueusers] Questions about pbs_server --ha
vgregorio at penguincomputing.com
Mon Apr 13 14:51:44 MDT 2009
I think I figured out a solution. The NFS mount for
/var/spool/torque/server_priv needs to be 'nolock' instead of the
* export options: *(rw,sync,no_root_squash)
* mount options on both pbs_servers: bg,intr,soft,nolock,rw
Then, I can run two pbs_servers with --ha, pull the plug on the primary
and the secondary picks up the pbs_server responsibilities.
Question: is the PID inside server.lock that of the primary pbs_server?
I notice it does not change when the secondary picks up
Is my solution sane? If so, should the Torque Documentation be updated?
On Mon, Apr 13, 2009 at 09:14:14AM -0700, Victor Gregorio wrote:
> Hello Ken,
> Thanks for the reply. I have a third system which exports NFS storage
> for both pbs_servers' /var/spool/torque/server_priv. For now, there is
> no NFS redundancy.
> * export options: *(rw,sync,no_root_squash)
> * mount options on both pbs_servers: bg,intr,soft,nolock,rw
> Victor Gregorio
> Penguin Computing
> On Mon, Apr 13, 2009 at 09:57:41AM -0600, Ken Nielson wrote:
> > Victor,
> > Tell us about your NFS setup. Where does the physical disk reside and is it setup to fail over to another system if the primary NFS fails?
> > Ken Nielson
> > --------------------
> > Cluster Resources
> > knielson at clusterresources.com
> > ----- Original Message -----
> > From: "Victor Gregorio" <vgregorio at penguincomputing.com>
> > To: torqueusers at supercluster.org
> > Sent: Friday, April 10, 2009 2:54:56 PM GMT -07:00 US/Canada Mountain
> > Subject: [torqueusers] Questions about pbs_server --ha
> > Hey folks :)
> > I've been lurking about for a bit and finally had a question to post.
> > So, I am using two systems with pbs_server --ha and a shared NFS mount
> > for /var/spool/torque/server_priv. In my testing, I bring down the
> > primary server by pulling the power plug. Unfortunately, the secondary
> > server does not pick up and become the primary pbs_server.
> > Is this because /var/spool/torque/server_priv/server.lock is not removed
> > when the primary server has a critical failure?
> > So, I tried removing the server.lock file, but the secondary pbs_server
> > --ha instance never picks up and becomes primary. What is the trigger
> > to activate a passive pbs_server --ha?
> > Any advice is appreciated.
> > Regards,
> > --
> > Victor Gregorio
> > Penguin Computing
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers