[torqueusers] Questions about pbs_server --ha

Stewart.Samuels at sanofi-aventis.com Stewart.Samuels at sanofi-aventis.com
Mon Apr 13 17:56:27 MDT 2009


Prakash,

Do you now have the system working with --HA?  Like Victor, I have never
been able to get --HA working as advertised.  In fact, I have found,
things tend to work okay if jobs were not running when the primary
server failed (I used a vmware cluster set up similar to what Victor
described) shutdown the primary vm.  Using virtual machines is a good
way to test this code because you can run tests varying the length of
the jobs that are executing and then pulling killing the vm to see if
the secondary picks up the running jobs.  In fact, this seems to be
fifty-fifty.

I ask you now if you got it running because it's been several snapshots
since I last tested HA.  I was just in the process of upgrading my
vmware as well as TORQUE and MAUI to test again to see if progress has
been made in this department.

I also tried Heartbeat to test but the same funtamental problem still
exists.  If jobs are running when the primary server fails, they stay in
a running state (from the view of the secondary server) after the
secondary server picks up.  You can submit new jobs and they will
complete (assuming the secondary doesn't fail), but the jobs that were
submitted originally from the primary remain in their state until the
original primary server is brought back online.  Even at that, it is not
guaranteed that the results of those jobs that got caught during the
failover are not corrupted.

	Stewart 

-----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Prakash
Velayutham
Sent: Monday, April 13, 2009 5:05 PM
To: Victor Gregorio
Cc: torqueusers at supercluster.org
Subject: Re: [torqueusers] Questions about pbs_server --ha

Hi Victor,

In my Torque HA setup, I see a PID number in the lock file when only one
of the HA servers is running. When both are running, there is no PID at
all in the file. It seems to be working fine for me, so I am guessing
this is correct.

Prakash

On Apr 13, 2009, at 4:51 PM, Victor Gregorio wrote:

> I think I figured out a solution.  The NFS mount for 
> /var/spool/torque/server_priv needs to be 'nolock' instead of the 
> default 'lock'.
>
>    * export options: *(rw,sync,no_root_squash)
>    * mount options on both pbs_servers: bg,intr,soft,nolock,rw
>
> Then, I can run two pbs_servers with --ha, pull the plug on the 
> primary and the secondary picks up the pbs_server responsibilities.
>
> Question: is the PID inside server.lock that of the primary 
> pbs_server?
> I notice it does not change when the secondary picks up 
> responsibilities.
>
> Is my solution sane?  If so, should the Torque Documentation be 
> updated?
>
> --
> Victor Gregorio
> Penguin Computing
>
> On Mon, Apr 13, 2009 at 09:14:14AM -0700, Victor Gregorio wrote:
>> Hello Ken,
>>
>> Thanks for the reply.  I have a third system which exports NFS 
>> storage for both pbs_servers' /var/spool/torque/server_priv.  For 
>> now, there is no NFS redundancy.
>>
>>    * export options: *(rw,sync,no_root_squash)
>>    * mount options on both pbs_servers: bg,intr,soft,nolock,rw
>>
>> --
>> Victor Gregorio
>> Penguin Computing
>>
>> On Mon, Apr 13, 2009 at 09:57:41AM -0600, Ken Nielson wrote:
>>> Victor,
>>>
>>> Tell us about your NFS setup. Where does the physical disk reside 
>>> and is it setup to fail over to another system if the primary NFS 
>>> fails?
>>>
>>> Ken Nielson
>>> --------------------
>>> Cluster Resources
>>> knielson at clusterresources.com
>>>
>>>
>>> ----- Original Message -----
>>> From: "Victor Gregorio" <vgregorio at penguincomputing.com>
>>> To: torqueusers at supercluster.org
>>> Sent: Friday, April 10, 2009 2:54:56 PM GMT -07:00 US/Canada 
>>> Mountain
>>> Subject: [torqueusers] Questions about pbs_server --ha
>>>
>>> Hey folks :)
>>>
>>> I've been lurking about for a bit and finally had a question to 
>>> post.
>>>
>>> So, I am using two systems with pbs_server --ha and a shared NFS 
>>> mount for /var/spool/torque/server_priv.  In my testing, I bring 
>>> down the primary server by pulling the power plug.  Unfortunately, 
>>> the secondary server does not pick up and become the primary 
>>> pbs_server.
>>>
>>> Is this because /var/spool/torque/server_priv/server.lock is not 
>>> removed when the primary server has a critical failure?
>>>
>>> So, I tried removing the server.lock file, but the secondary 
>>> pbs_server --ha instance never picks up and becomes primary.  What 
>>> is the trigger to activate a passive pbs_server --ha?
>>>
>>> Any advice is appreciated.
>>>
>>> Regards,
>>>
>>> --
>>> Victor Gregorio
>>> Penguin Computing
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list