[torqueusers] Questions about pbs_server --ha

Victor Gregorio vgregorio at penguincomputing.com
Mon Apr 13 12:37:32 MDT 2009


On Mon, Apr 13, 2009 at 10:59:01AM -0600, Ken Nielson wrote:
> Victor,
> 
> Your observation about torque's --ha option is correct. If the controlling pbs_server just goes away the lock on the file will remain in place. You can delete the lock file on the NFS share and a new file will be created by the redundant pbs_server and the process will start.
> 

So, for this test I ran both pbs_servers at the same time using --ha.
Then, I pulled the power plug from the primary server.  Next, I removed
the server.lock from the shared NFS partition for
/var/spool/torque/server_priv. 

Unfortunately, the secondary pbs_server running in --ha does not pick up
as primary after removing the server.lock file.

It appears that pbs_server --ha is stuck waiting for an unavailable
resource:

root# ps -aef | grep pbs
root     27633     1  0 03:07 pts/1    00:00:00 /usr/sbin/pbs_server --ha
root     27641     1  0 03:07 ?        00:00:00 /usr/sbin/pbs_sched
root     29540     1  0 03:10 ?        00:00:00 [pbs_mom]
root     30019     1  0 03:10 ?        00:00:00 [pbs_mom]
root     30054 23570  0 03:12 pts/1    00:00:00 grep pbs

root# strace -p 27633
Process 27633 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
fcntl(3, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = -1
EAGAIN (Resource temporarily unavailable)
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, {1, 0})               = 0
fcntl(3, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = -1
EAGAIN (Resource temporarily unavailable)
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, {1, 0})               = 0

root# ll /proc/27633/fd
total 4
lrwx------  1 root root 64 Apr 13 03:10 0 -> /dev/pts/1
l-wx------  1 root root 64 Apr 13 03:10 1 -> pipe:[35681]
l-wx------  1 root root 64 Apr 13 03:07 2 -> pipe:[35682]
l-wx------  1 root root 64 Apr 13 03:10 3 -> /var/spool/torque/server_priv/.nfs00000000005842b000000001

root# rm /var/spool/torque/server_priv/.nfs00000000005842b000000001
rm: cannot remove `/var/spool/torque/server_priv/.nfs00000000005842b000000001': Device or resource busy

I believe I am using the export and mount options from the Torque documentation:

    * export options: *(rw,sync,no_root_squash)
    * mount options on both pbs_servers: bg,intr,soft,rw

Thoughts?

-- 
Victor Gregorio
Penguin Computing

> We understand that this is not the ideal way to make torque highly available. We should put this on the list of things to do to help improve torque. 
> 
> If it is possible you may try using a high availability OS. That is redundant systems. If one machine goes down another machine is able to load and run the image of the failed system. I know this option requires far more resources but I just wanted to suggest it in case it is something you can do.
> 
> Regards
> 
> Ken Nielson
> Cluster Resources
> knielson at clusterresources.com
> 
> ----- Original Message -----
> From: "Victor Gregorio" <vgregorio at penguincomputing.com>
> To: "Ken Nielson" <knielson at clusterresources.com>
> Cc: torqueusers at supercluster.org
> Sent: Monday, April 13, 2009 10:14:14 AM GMT -07:00 US/Canada Mountain
> Subject: Re: [torqueusers] Questions about pbs_server --ha
> 
> Hello Ken,
> 
> Thanks for the reply.  I have a third system which exports NFS storage
> for both pbs_servers' /var/spool/torque/server_priv.  For now, there is
> no NFS redundancy.
> 
>     * export options: *(rw,sync,no_root_squash)
>     * mount options on both pbs_servers: bg,intr,soft,rw
> 
> -- 
> Victor Gregorio
> Penguin Computing
> 
> On Mon, Apr 13, 2009 at 09:57:41AM -0600, Ken Nielson wrote:
> > Victor,
> > 
> > Tell us about your NFS setup. Where does the physical disk reside and is it setup to fail over to another system if the primary NFS fails?
> > 
> > Ken Nielson
> > --------------------
> > Cluster Resources
> > knielson at clusterresources.com
> > 
> > 
> > ----- Original Message -----
> > From: "Victor Gregorio" <vgregorio at penguincomputing.com>
> > To: torqueusers at supercluster.org
> > Sent: Friday, April 10, 2009 2:54:56 PM GMT -07:00 US/Canada Mountain
> > Subject: [torqueusers] Questions about pbs_server --ha
> > 
> > Hey folks :)
> > 
> > I've been lurking about for a bit and finally had a question to post.
> > 
> > So, I am using two systems with pbs_server --ha and a shared NFS mount
> > for /var/spool/torque/server_priv.  In my testing, I bring down the
> > primary server by pulling the power plug.  Unfortunately, the secondary
> > server does not pick up and become the primary pbs_server.
> > 
> > Is this because /var/spool/torque/server_priv/server.lock is not removed
> > when the primary server has a critical failure?
> > 
> > So, I tried removing the server.lock file, but the secondary pbs_server
> > --ha instance never picks up and becomes primary.  What is the trigger
> > to activate a passive pbs_server --ha?
> > 
> > Any advice is appreciated.
> > 
> > Regards,
> > 
> > -- 
> > Victor Gregorio
> > Penguin Computing
> > 
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list