[torqueusers] Questions about pbs_server --ha

Ken Nielson knielson at clusterresources.com
Mon Apr 13 10:59:01 MDT 2009


Victor,

Your observation about torque's --ha option is correct. If the controlling pbs_server just goes away the lock on the file will remain in place. You can delete the lock file on the NFS share and a new file will be created by the redundant pbs_server and the process will start.

We understand that this is not the ideal way to make torque highly available. We should put this on the list of things to do to help improve torque. 

If it is possible you may try using a high availability OS. That is redundant systems. If one machine goes down another machine is able to load and run the image of the failed system. I know this option requires far more resources but I just wanted to suggest it in case it is something you can do.

Regards

Ken Nielson
Cluster Resources
knielson at clusterresources.com

----- Original Message -----
From: "Victor Gregorio" <vgregorio at penguincomputing.com>
To: "Ken Nielson" <knielson at clusterresources.com>
Cc: torqueusers at supercluster.org
Sent: Monday, April 13, 2009 10:14:14 AM GMT -07:00 US/Canada Mountain
Subject: Re: [torqueusers] Questions about pbs_server --ha

Hello Ken,

Thanks for the reply.  I have a third system which exports NFS storage
for both pbs_servers' /var/spool/torque/server_priv.  For now, there is
no NFS redundancy.

    * export options: *(rw,sync,no_root_squash)
    * mount options on both pbs_servers: bg,intr,soft,rw

-- 
Victor Gregorio
Penguin Computing

On Mon, Apr 13, 2009 at 09:57:41AM -0600, Ken Nielson wrote:
> Victor,
> 
> Tell us about your NFS setup. Where does the physical disk reside and is it setup to fail over to another system if the primary NFS fails?
> 
> Ken Nielson
> --------------------
> Cluster Resources
> knielson at clusterresources.com
> 
> 
> ----- Original Message -----
> From: "Victor Gregorio" <vgregorio at penguincomputing.com>
> To: torqueusers at supercluster.org
> Sent: Friday, April 10, 2009 2:54:56 PM GMT -07:00 US/Canada Mountain
> Subject: [torqueusers] Questions about pbs_server --ha
> 
> Hey folks :)
> 
> I've been lurking about for a bit and finally had a question to post.
> 
> So, I am using two systems with pbs_server --ha and a shared NFS mount
> for /var/spool/torque/server_priv.  In my testing, I bring down the
> primary server by pulling the power plug.  Unfortunately, the secondary
> server does not pick up and become the primary pbs_server.
> 
> Is this because /var/spool/torque/server_priv/server.lock is not removed
> when the primary server has a critical failure?
> 
> So, I tried removing the server.lock file, but the secondary pbs_server
> --ha instance never picks up and becomes primary.  What is the trigger
> to activate a passive pbs_server --ha?
> 
> Any advice is appreciated.
> 
> Regards,
> 
> -- 
> Victor Gregorio
> Penguin Computing
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list