[torqueusers] Re: File system "snapshot" fails when using Torque

Rick.Ingham at expeditors.com Rick.Ingham at expeditors.com
Wed Aug 3 18:19:28 MDT 2005


On Solaris, apparently it locks the program files that in our case live in
/usr/local/sbin.  Since there is some kind of lock on
/usr/local/sbin/pbs_mom, then a global write lock on the /usr file system
fails.

That's basically what the Sun bug report is describing.

Whether Sun "should" be doing that or not, is an entirely different
question.  Sun says there are a number of related RFE's to address problems
like this relating to file system snapshotting.

--- Rick Ingham, Expeditors Int'l / IS
---- RICK.INGHAM at EXPEDITORS.COM  (206) 674-3400 x3284   FAX  246-3197


                                                                           
             Garrick Staples                                               
             <garrick at usc.edu>                                             
             Sent by:                                                   To 
             torqueusers-bounc         torqueusers at supercluster.org        
             es at supercluster.o                                          cc 
             rg                                                            
                                                                   Subject 
                                       Re: [torqueusers] Re: File system   
             08/03/2005 05:13          "snapshot" fails when using Torque  
             PM                                                            
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




On Wed, Aug 03, 2005 at 04:30:33PM -0700, Rick.Ingham at expeditors.com
alleged:
> I think my problem stems from --enable-plock-deamons.  Here's a
description
> from Sun why the file system snapshot may be failing.
>
>       xntpd runs in the realtime class, and locks
>       down all of its pages with mlockall.  This includes the xntpd
binary
>       and the various libraries it is linked with.  This prevents lockfs
>       from being able to acquire a write lock on the file system --- the
>       check in ufs_thaw_wlock fails:

I'm confused.  My Solaris and Linux mlock/mlockall manpages state that they
"disable paging" (implies to me both swapping out dirty pages and releasing
clean code/data pages).

I don't see how it would prevent a write lock on the fs.  I don't see how
it
has anything to do with writing to the filesystem.

Someone please correct me.  I'm curious about this.

Btw, torque doesn't use mlock/mlockall, it uses plock().


> I'm going to try using --disable-plock-deamons.
>
> By the way, what is the default value ?  Disabled ?  Enabled=0 ?

The configure script definitely defaults to disabled:

# Check whether --enable-plock_daemons or --disable-plock_daemons was
given.
if test "${enable_plock_daemons+set}" = set; then
  enableval="$enable_plock_daemons"
  case "${enableval}" in
  yes) PLOCK_DAEMONS=7 ;;
  no)  PLOCK_DAEMONS=0 ;;
  *) PLOCK_DAEMONS="${enableval}" ;;
esac
else
  PLOCK_DAEMONS=0
fi



> --- Rick Ingham, Expeditors Int'l / IS
> ---- RICK.INGHAM at EXPEDITORS.COM  (206) 674-3400 x3284   FAX  246-3197
>
>
>

>              Rick

>              Ingham/IS/Expedit

>              ors
To
>                                        torqueusers at supercluster.org

>              07/29/2005 09:55
cc
>              AM

>
Subject
>                                        File system "snapshot" fails when

>                                        using Torque

>

>

>

>

>

>

>
>
>
> We have been using OpenPBS on 100+ servers (standalone systems with
server,
> scheduler, and mom) for many months on Sun Solaris 9 Sparc systems.  Last
> week we deployed Torque 1.2.0p5 to five servers.   Since then, the file
> system snapshot (snapfs) of the /usr mount point has been failing with
> every daily backup.
>
> Our PBS programs are installed in:
>       /usr/local/sbin
>       /usr/local/bin
>
> PBSHOME is:
>       /var/spool/PBS
>
> I have not been able to explain why the snapshot is getting wrapped
around
> the axle.  'lsof' shows nothing on /usr that the pbs_* daemons have open
or
> locked.
>
> Any ideas?  I'm stumped.
>
> --- Rick Ingham, Expeditors Int'l / IS
> ---- RICK.INGHAM at EXPEDITORS.COM  (206) 674-3400 x3284   FAX  246-3197
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
(See attached file: att3760f.dat)
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
A non-text attachment was scrubbed...
Name: att3760f.dat
Type: application/octet-stream
Size: 196 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050803/c77f79ef/att3760f-0001.obj


More information about the torqueusers mailing list