[torqueusers] Questions about pbs_server --ha

Prakash Velayutham prakash.velayutham at cchmc.org
Mon Apr 13 15:04:50 MDT 2009


Hi Victor,

In my Torque HA setup, I see a PID number in the lock file when only  
one of the HA servers is running. When both are running, there is no  
PID at all in the file. It seems to be working fine for me, so I am  
guessing this is correct.

Prakash

On Apr 13, 2009, at 4:51 PM, Victor Gregorio wrote:

> I think I figured out a solution.  The NFS mount for
> /var/spool/torque/server_priv needs to be 'nolock' instead of the
> default 'lock'.
>
>    * export options: *(rw,sync,no_root_squash)
>    * mount options on both pbs_servers: bg,intr,soft,nolock,rw
>
> Then, I can run two pbs_servers with --ha, pull the plug on the  
> primary
> and the secondary picks up the pbs_server responsibilities.
>
> Question: is the PID inside server.lock that of the primary  
> pbs_server?
> I notice it does not change when the secondary picks up
> responsibilities.
>
> Is my solution sane?  If so, should the Torque Documentation be  
> updated?
>
> -- 
> Victor Gregorio
> Penguin Computing
>
> On Mon, Apr 13, 2009 at 09:14:14AM -0700, Victor Gregorio wrote:
>> Hello Ken,
>>
>> Thanks for the reply.  I have a third system which exports NFS  
>> storage
>> for both pbs_servers' /var/spool/torque/server_priv.  For now,  
>> there is
>> no NFS redundancy.
>>
>>    * export options: *(rw,sync,no_root_squash)
>>    * mount options on both pbs_servers: bg,intr,soft,nolock,rw
>>
>> -- 
>> Victor Gregorio
>> Penguin Computing
>>
>> On Mon, Apr 13, 2009 at 09:57:41AM -0600, Ken Nielson wrote:
>>> Victor,
>>>
>>> Tell us about your NFS setup. Where does the physical disk reside  
>>> and is it setup to fail over to another system if the primary NFS  
>>> fails?
>>>
>>> Ken Nielson
>>> --------------------
>>> Cluster Resources
>>> knielson at clusterresources.com
>>>
>>>
>>> ----- Original Message -----
>>> From: "Victor Gregorio" <vgregorio at penguincomputing.com>
>>> To: torqueusers at supercluster.org
>>> Sent: Friday, April 10, 2009 2:54:56 PM GMT -07:00 US/Canada  
>>> Mountain
>>> Subject: [torqueusers] Questions about pbs_server --ha
>>>
>>> Hey folks :)
>>>
>>> I've been lurking about for a bit and finally had a question to  
>>> post.
>>>
>>> So, I am using two systems with pbs_server --ha and a shared NFS  
>>> mount
>>> for /var/spool/torque/server_priv.  In my testing, I bring down the
>>> primary server by pulling the power plug.  Unfortunately, the  
>>> secondary
>>> server does not pick up and become the primary pbs_server.
>>>
>>> Is this because /var/spool/torque/server_priv/server.lock is not  
>>> removed
>>> when the primary server has a critical failure?
>>>
>>> So, I tried removing the server.lock file, but the secondary  
>>> pbs_server
>>> --ha instance never picks up and becomes primary.  What is the  
>>> trigger
>>> to activate a passive pbs_server --ha?
>>>
>>> Any advice is appreciated.
>>>
>>> Regards,
>>>
>>> -- 
>>> Victor Gregorio
>>> Penguin Computing
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list