[torqueusers] Questions about pbs_server --ha

Victor Gregorio vgregorio at penguincomputing.com
Tue Apr 14 10:07:54 MDT 2009


Hello Prakash,

Very interesting.  Thanks for the reply.  Have you attempted to pull the
plug on the primary to test the failover?  If your failover is working,
I am curious to what your settings are.  In particular:

* Version of Torque you are using?
* NFS export options for the server_priv folder?
* NFS mount options for the server_priv folder?

Regards,

-- 
Victor Gregorio
Penguin Computing

On Mon, Apr 13, 2009 at 05:04:50PM -0400, Prakash Velayutham wrote:
> Hi Victor,
>
> In my Torque HA setup, I see a PID number in the lock file when only one 
> of the HA servers is running. When both are running, there is no PID at 
> all in the file. It seems to be working fine for me, so I am guessing this 
> is correct.
>
> Prakash
>
> On Apr 13, 2009, at 4:51 PM, Victor Gregorio wrote:
>
>> I think I figured out a solution.  The NFS mount for
>> /var/spool/torque/server_priv needs to be 'nolock' instead of the
>> default 'lock'.
>>
>>    * export options: *(rw,sync,no_root_squash)
>>    * mount options on both pbs_servers: bg,intr,soft,nolock,rw
>>
>> Then, I can run two pbs_servers with --ha, pull the plug on the primary
>> and the secondary picks up the pbs_server responsibilities.
>>
>> Question: is the PID inside server.lock that of the primary pbs_server?
>> I notice it does not change when the secondary picks up
>> responsibilities.
>>
>> Is my solution sane?  If so, should the Torque Documentation be  
>> updated?
>>
>> -- 
>> Victor Gregorio
>> Penguin Computing
>>
>> On Mon, Apr 13, 2009 at 09:14:14AM -0700, Victor Gregorio wrote:
>>> Hello Ken,
>>>
>>> Thanks for the reply.  I have a third system which exports NFS  
>>> storage
>>> for both pbs_servers' /var/spool/torque/server_priv.  For now, there 
>>> is
>>> no NFS redundancy.
>>>
>>>    * export options: *(rw,sync,no_root_squash)
>>>    * mount options on both pbs_servers: bg,intr,soft,nolock,rw
>>>
>>> -- 
>>> Victor Gregorio
>>> Penguin Computing
>>>
>>> On Mon, Apr 13, 2009 at 09:57:41AM -0600, Ken Nielson wrote:
>>>> Victor,
>>>>
>>>> Tell us about your NFS setup. Where does the physical disk reside  
>>>> and is it setup to fail over to another system if the primary NFS  
>>>> fails?
>>>>
>>>> Ken Nielson
>>>> --------------------
>>>> Cluster Resources
>>>> knielson at clusterresources.com
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: "Victor Gregorio" <vgregorio at penguincomputing.com>
>>>> To: torqueusers at supercluster.org
>>>> Sent: Friday, April 10, 2009 2:54:56 PM GMT -07:00 US/Canada  
>>>> Mountain
>>>> Subject: [torqueusers] Questions about pbs_server --ha
>>>>
>>>> Hey folks :)
>>>>
>>>> I've been lurking about for a bit and finally had a question to  
>>>> post.
>>>>
>>>> So, I am using two systems with pbs_server --ha and a shared NFS  
>>>> mount
>>>> for /var/spool/torque/server_priv.  In my testing, I bring down the
>>>> primary server by pulling the power plug.  Unfortunately, the  
>>>> secondary
>>>> server does not pick up and become the primary pbs_server.
>>>>
>>>> Is this because /var/spool/torque/server_priv/server.lock is not  
>>>> removed
>>>> when the primary server has a critical failure?
>>>>
>>>> So, I tried removing the server.lock file, but the secondary  
>>>> pbs_server
>>>> --ha instance never picks up and becomes primary.  What is the  
>>>> trigger
>>>> to activate a passive pbs_server --ha?
>>>>
>>>> Any advice is appreciated.
>>>>
>>>> Regards,
>>>>
>>>> -- 
>>>> Victor Gregorio
>>>> Penguin Computing
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>


More information about the torqueusers mailing list