[torqueusers] Questions about pbs_server --ha

Victor Gregorio vgregorio at penguincomputing.com
Tue Apr 14 09:54:55 MDT 2009


Update: My success ended up being varied with the --ha solution I
outlined below.  When I pulled the plug on the primary, the secondary's
qstat listing was sometimes different than what was on the primary at
the time of failure.

So, what I am doing now is taking Ken's advice and  running pbs_server
without --ha, and having the OS's heartbeat services manage whether or
not the primary or secondary system has pbs_server running.

Please note that I still had to mount the shared NFS partition for
/var/spool/torque/server_priv as 'nolock'.  Otherwise, the secondary
pbs_server had trouble starting (subsystem locked error) when the
primary went down.

-- 
Victor Gregorio
Penguin Computing

On Mon, Apr 13, 2009 at 01:51:44PM -0700, Victor Gregorio wrote:
> I think I figured out a solution.  The NFS mount for
> /var/spool/torque/server_priv needs to be 'nolock' instead of the
> default 'lock'.
> 
>     * export options: *(rw,sync,no_root_squash)
>     * mount options on both pbs_servers: bg,intr,soft,nolock,rw
> 
> Then, I can run two pbs_servers with --ha, pull the plug on the primary
> and the secondary picks up the pbs_server responsibilities. 
> 
> Question: is the PID inside server.lock that of the primary pbs_server?
> I notice it does not change when the secondary picks up
> responsibilities.
> 
> Is my solution sane?  If so, should the Torque Documentation be updated?
> 
> -- 
> Victor Gregorio
> Penguin Computing
> 
> On Mon, Apr 13, 2009 at 09:14:14AM -0700, Victor Gregorio wrote:
> > Hello Ken,
> > 
> > Thanks for the reply.  I have a third system which exports NFS storage
> > for both pbs_servers' /var/spool/torque/server_priv.  For now, there is
> > no NFS redundancy.
> > 
> >     * export options: *(rw,sync,no_root_squash)
> >     * mount options on both pbs_servers: bg,intr,soft,nolock,rw
> > 
> > -- 
> > Victor Gregorio
> > Penguin Computing
> > 
> > On Mon, Apr 13, 2009 at 09:57:41AM -0600, Ken Nielson wrote:
> > > Victor,
> > > 
> > > Tell us about your NFS setup. Where does the physical disk reside and is it setup to fail over to another system if the primary NFS fails?
> > > 
> > > Ken Nielson
> > > --------------------
> > > Cluster Resources
> > > knielson at clusterresources.com
> > > 
> > > 
> > > ----- Original Message -----
> > > From: "Victor Gregorio" <vgregorio at penguincomputing.com>
> > > To: torqueusers at supercluster.org
> > > Sent: Friday, April 10, 2009 2:54:56 PM GMT -07:00 US/Canada Mountain
> > > Subject: [torqueusers] Questions about pbs_server --ha
> > > 
> > > Hey folks :)
> > > 
> > > I've been lurking about for a bit and finally had a question to post.
> > > 
> > > So, I am using two systems with pbs_server --ha and a shared NFS mount
> > > for /var/spool/torque/server_priv.  In my testing, I bring down the
> > > primary server by pulling the power plug.  Unfortunately, the secondary
> > > server does not pick up and become the primary pbs_server.
> > > 
> > > Is this because /var/spool/torque/server_priv/server.lock is not removed
> > > when the primary server has a critical failure?
> > > 
> > > So, I tried removing the server.lock file, but the secondary pbs_server
> > > --ha instance never picks up and becomes primary.  What is the trigger
> > > to activate a passive pbs_server --ha?
> > > 
> > > Any advice is appreciated.
> > > 
> > > Regards,
> > > 
> > > -- 
> > > Victor Gregorio
> > > Penguin Computing
> > > 
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list