[torquedev] proposed change in directory structure

Garrick Staples garrick at usc.edu
Fri Jul 11 15:06:31 MDT 2008

On Fri, Jul 11, 2008 at 04:57:30PM -0400, Glen Beane alleged:
> On Fri, Jul 11, 2008 at 4:43 PM, Garrick Staples <garrick at usc.edu> wrote:
> > On Fri, Jul 11, 2008 at 04:28:43PM -0400, Glen Beane alleged:
> > > I've been working on some changes in trunk that transfer the .OU and .ER
> > > spool files from pbs_mom back to pbs_server. This is one of the steps we
> > > need to take so that a job in the COMPLETE state can be restarted from a
> > > checkpoint file.  (the files are only returned to the server if
> > > keep_completed is positive and the job has a checkpoint file)
> > >
> > > There are problems when the spool file is shared between pbs_server and
> > the
> > > mother superior pbs_mom. What happens is that when the files are
> > "returned"
> > > pbs_server takes ownership of the .ER and .OU files in the spool dir and
> > > when pbs_mom forks to the user to copy the files back to the user home
> > > directory they are unable to do so because of a permission denied error.
> >  I
> > > feel that the cleanest solution is to just separate the pbs_server and
> > > pbs_mom spool directories.  In my current working copy of trunk I have
> > > changed pbs_server to use server_home/server_spool instead of
> > > server_home/spool.  pbs_mom continues to use server_home/spool.  This
> > solves
> > > my problems because when the spool files are returned to pbs_server
> > pbs_mom
> > > retains its copy it its own spool directory. It is then free to fork to
> > the
> > > user to copy the files and then delete them.
> > >
> > > Are there any objections to this change in trunk? (the change will be
> > > introduced with the release of TORQUE 2.4.0)
> >
> > So we're doing a useless copy from server_home/spool to
> > server_home/server_spool?   At my site, these files are often a significant
> > percentage of the filesystem.  If a file is more than 50% of the total
> > filesystem, then this is going to fail.
> >
> > Why not just have the server check if it already has the file and not issue
> > a
> > copy request?
> you probably don't run pbs_mom and pbs_server on the same host do you?  ;)
> I think 99% of the time the copy from pbs_mom to pbs_server is going to be
> required.
> As for this special case, yes if they can share the same spool directory
> then it would be good so we don't have to do the copy, however the problem
> is letting pbs_mom know that pbs_server is using the same spool directory.
> pbs_mom assumes it owns the file and will delete it when it is done. If
> pbs_server takes ownership of the file then it either has to make it world
> readable (then anyone can snoop on the contents of the .OU and .ER files
> while the job is in the COMPLETE state) or pbs_mom can not copy the file
> back to the user directory (permission denied). If it does not take
> ownership of the file then there needs to be some way to keep pbs_mom from
> deleting the file when it is done with it.
> I guess we could have a job attribute do_not_delete_spool_files that
> pbs_server could set.  What do you think? Then I would skip the copy but I
> would still have to make pbs_server know it owns the files so it cleans them
> up after the keep_compled time expires.

pbs_server can check if the file already exists, if so, hardlink it to its own
name, never ask pbs_mom to copy the file, and rename it back to the original

Seems kind of hackish, but possibly less ugly than figuring out the
do_not_delete_spool_files job attr.

> This brings up another issue.  If there are a lot of checkpointed jobs in
> the COMPLETE state then that means there can be a huge amount of data that
> has to be stored by pbs_server.

Ah well, you can leave that up to the sysadmin.

Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20080711/6af832e1/attachment.bin

More information about the torquedev mailing list