[torquedev] proposed change in directory structure

Glen Beane glen.beane at gmail.com
Fri Jul 11 14:57:30 MDT 2008

On Fri, Jul 11, 2008 at 4:43 PM, Garrick Staples <garrick at usc.edu> wrote:

> On Fri, Jul 11, 2008 at 04:28:43PM -0400, Glen Beane alleged:
> > I've been working on some changes in trunk that transfer the .OU and .ER
> > spool files from pbs_mom back to pbs_server. This is one of the steps we
> > need to take so that a job in the COMPLETE state can be restarted from a
> > checkpoint file.  (the files are only returned to the server if
> > keep_completed is positive and the job has a checkpoint file)
> >
> > There are problems when the spool file is shared between pbs_server and
> the
> > mother superior pbs_mom. What happens is that when the files are
> "returned"
> > pbs_server takes ownership of the .ER and .OU files in the spool dir and
> > when pbs_mom forks to the user to copy the files back to the user home
> > directory they are unable to do so because of a permission denied error.
>  I
> > feel that the cleanest solution is to just separate the pbs_server and
> > pbs_mom spool directories.  In my current working copy of trunk I have
> > changed pbs_server to use server_home/server_spool instead of
> > server_home/spool.  pbs_mom continues to use server_home/spool.  This
> solves
> > my problems because when the spool files are returned to pbs_server
> pbs_mom
> > retains its copy it its own spool directory. It is then free to fork to
> the
> > user to copy the files and then delete them.
> >
> > Are there any objections to this change in trunk? (the change will be
> > introduced with the release of TORQUE 2.4.0)
> So we're doing a useless copy from server_home/spool to
> server_home/server_spool?   At my site, these files are often a significant
> percentage of the filesystem.  If a file is more than 50% of the total
> filesystem, then this is going to fail.
> Why not just have the server check if it already has the file and not issue
> a
> copy request?

you probably don't run pbs_mom and pbs_server on the same host do you?  ;)
I think 99% of the time the copy from pbs_mom to pbs_server is going to be

As for this special case, yes if they can share the same spool directory
then it would be good so we don't have to do the copy, however the problem
is letting pbs_mom know that pbs_server is using the same spool directory.
pbs_mom assumes it owns the file and will delete it when it is done. If
pbs_server takes ownership of the file then it either has to make it world
readable (then anyone can snoop on the contents of the .OU and .ER files
while the job is in the COMPLETE state) or pbs_mom can not copy the file
back to the user directory (permission denied). If it does not take
ownership of the file then there needs to be some way to keep pbs_mom from
deleting the file when it is done with it.

I guess we could have a job attribute do_not_delete_spool_files that
pbs_server could set.  What do you think? Then I would skip the copy but I
would still have to make pbs_server know it owns the files so it cleans them
up after the keep_compled time expires.

This brings up another issue.  If there are a lot of checkpointed jobs in
the COMPLETE state then that means there can be a huge amount of data that
has to be stored by pbs_server.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20080711/e436d6db/attachment.html

More information about the torquedev mailing list