[torqueusers] Questions about pbs_server --ha

Prakash Velayutham prakash.velayutham at cchmc.org
Tue Apr 14 06:36:22 MDT 2009


Hi Stewart,

It has been a while, but I do remember testing all of these in my  
setup (same as yours, VMWare-based OpenSUSE Torque/Moab service  
nodes). If you want, I can kick off some more tests just to confirm.

And, I am at Torque 2.3.6.

Regards,
Prakash

On Apr 13, 2009, at 7:56 PM, <Stewart.Samuels at sanofi-aventis.com> wrote:

> Prakash,
>
> Do you now have the system working with --HA?  Like Victor, I have  
> never
> been able to get --HA working as advertised.  In fact, I have found,
> things tend to work okay if jobs were not running when the primary
> server failed (I used a vmware cluster set up similar to what Victor
> described) shutdown the primary vm.  Using virtual machines is a good
> way to test this code because you can run tests varying the length of
> the jobs that are executing and then pulling killing the vm to see if
> the secondary picks up the running jobs.  In fact, this seems to be
> fifty-fifty.
>
> I ask you now if you got it running because it's been several  
> snapshots
> since I last tested HA.  I was just in the process of upgrading my
> vmware as well as TORQUE and MAUI to test again to see if progress has
> been made in this department.
>
> I also tried Heartbeat to test but the same funtamental problem still
> exists.  If jobs are running when the primary server fails, they  
> stay in
> a running state (from the view of the secondary server) after the
> secondary server picks up.  You can submit new jobs and they will
> complete (assuming the secondary doesn't fail), but the jobs that were
> submitted originally from the primary remain in their state until the
> original primary server is brought back online.  Even at that, it is  
> not
> guaranteed that the results of those jobs that got caught during the
> failover are not corrupted.
>
> 	Stewart
>
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org
> [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Prakash
> Velayutham
> Sent: Monday, April 13, 2009 5:05 PM
> To: Victor Gregorio
> Cc: torqueusers at supercluster.org
> Subject: Re: [torqueusers] Questions about pbs_server --ha
>
> Hi Victor,
>
> In my Torque HA setup, I see a PID number in the lock file when only  
> one
> of the HA servers is running. When both are running, there is no PID  
> at
> all in the file. It seems to be working fine for me, so I am guessing
> this is correct.
>
> Prakash
>
> On Apr 13, 2009, at 4:51 PM, Victor Gregorio wrote:
>
>> I think I figured out a solution.  The NFS mount for
>> /var/spool/torque/server_priv needs to be 'nolock' instead of the
>> default 'lock'.
>>
>>   * export options: *(rw,sync,no_root_squash)
>>   * mount options on both pbs_servers: bg,intr,soft,nolock,rw
>>
>> Then, I can run two pbs_servers with --ha, pull the plug on the
>> primary and the secondary picks up the pbs_server responsibilities.
>>
>> Question: is the PID inside server.lock that of the primary
>> pbs_server?
>> I notice it does not change when the secondary picks up
>> responsibilities.
>>
>> Is my solution sane?  If so, should the Torque Documentation be
>> updated?
>>
>> --
>> Victor Gregorio
>> Penguin Computing
>>
>> On Mon, Apr 13, 2009 at 09:14:14AM -0700, Victor Gregorio wrote:
>>> Hello Ken,
>>>
>>> Thanks for the reply.  I have a third system which exports NFS
>>> storage for both pbs_servers' /var/spool/torque/server_priv.  For
>>> now, there is no NFS redundancy.
>>>
>>>   * export options: *(rw,sync,no_root_squash)
>>>   * mount options on both pbs_servers: bg,intr,soft,nolock,rw
>>>
>>> --
>>> Victor Gregorio
>>> Penguin Computing
>>>
>>> On Mon, Apr 13, 2009 at 09:57:41AM -0600, Ken Nielson wrote:
>>>> Victor,
>>>>
>>>> Tell us about your NFS setup. Where does the physical disk reside
>>>> and is it setup to fail over to another system if the primary NFS
>>>> fails?
>>>>
>>>> Ken Nielson
>>>> --------------------
>>>> Cluster Resources
>>>> knielson at clusterresources.com
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: "Victor Gregorio" <vgregorio at penguincomputing.com>
>>>> To: torqueusers at supercluster.org
>>>> Sent: Friday, April 10, 2009 2:54:56 PM GMT -07:00 US/Canada
>>>> Mountain
>>>> Subject: [torqueusers] Questions about pbs_server --ha
>>>>
>>>> Hey folks :)
>>>>
>>>> I've been lurking about for a bit and finally had a question to
>>>> post.
>>>>
>>>> So, I am using two systems with pbs_server --ha and a shared NFS
>>>> mount for /var/spool/torque/server_priv.  In my testing, I bring
>>>> down the primary server by pulling the power plug.  Unfortunately,
>>>> the secondary server does not pick up and become the primary
>>>> pbs_server.
>>>>
>>>> Is this because /var/spool/torque/server_priv/server.lock is not
>>>> removed when the primary server has a critical failure?
>>>>
>>>> So, I tried removing the server.lock file, but the secondary
>>>> pbs_server --ha instance never picks up and becomes primary.  What
>>>> is the trigger to activate a passive pbs_server --ha?
>>>>
>>>> Any advice is appreciated.
>>>>
>>>> Regards,
>>>>
>>>> --
>>>> Victor Gregorio
>>>> Penguin Computing
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list