[torquedev] proposed change in directory structure

Glen Beane glen.beane at gmail.com
Fri Jul 11 15:25:56 MDT 2008


On Fri, Jul 11, 2008 at 5:06 PM, Garrick Staples <garrick at usc.edu> wrote:

> On Fri, Jul 11, 2008 at 04:57:30PM -0400, Glen Beane alleged:
> > On Fri, Jul 11, 2008 at 4:43 PM, Garrick Staples <garrick at usc.edu>
> wrote:
> >
> > > On Fri, Jul 11, 2008 at 04:28:43PM -0400, Glen Beane alleged:
> > > > I've been working on some changes in trunk that transfer the .OU and
> .ER
> > > > spool files from pbs_mom back to pbs_server. This is one of the steps
> we
> > > > need to take so that a job in the COMPLETE state can be restarted
> from a
> > > > checkpoint file.  (the files are only returned to the server if
> > > > keep_completed is positive and the job has a checkpoint file)
> > > >
> > > > There are problems when the spool file is shared between pbs_server
> and
> > > the
> > > > mother superior pbs_mom. What happens is that when the files are
> > > "returned"
> > > > pbs_server takes ownership of the .ER and .OU files in the spool dir
> and
> > > > when pbs_mom forks to the user to copy the files back to the user
> home
> > > > directory they are unable to do so because of a permission denied
> error.
> > >  I
> > > > feel that the cleanest solution is to just separate the pbs_server
> and
> > > > pbs_mom spool directories.  In my current working copy of trunk I
> have
> > > > changed pbs_server to use server_home/server_spool instead of
> > > > server_home/spool.  pbs_mom continues to use server_home/spool.  This
> > > solves
> > > > my problems because when the spool files are returned to pbs_server
> > > pbs_mom
> > > > retains its copy it its own spool directory. It is then free to fork
> to
> > > the
> > > > user to copy the files and then delete them.
> > > >
> > > > Are there any objections to this change in trunk? (the change will be
> > > > introduced with the release of TORQUE 2.4.0)
> > >
> > > So we're doing a useless copy from server_home/spool to
> > > server_home/server_spool?   At my site, these files are often a
> significant
> > > percentage of the filesystem.  If a file is more than 50% of the total
> > > filesystem, then this is going to fail.
> > >
> > > Why not just have the server check if it already has the file and not
> issue
> > > a
> > > copy request?
> >
> >
> > you probably don't run pbs_mom and pbs_server on the same host do you?
>  ;)
> > I think 99% of the time the copy from pbs_mom to pbs_server is going to
> be
> > required.
> >
> > As for this special case, yes if they can share the same spool directory
> > then it would be good so we don't have to do the copy, however the
> problem
> > is letting pbs_mom know that pbs_server is using the same spool
> directory.
> > pbs_mom assumes it owns the file and will delete it when it is done. If
> > pbs_server takes ownership of the file then it either has to make it
> world
> > readable (then anyone can snoop on the contents of the .OU and .ER files
> > while the job is in the COMPLETE state) or pbs_mom can not copy the file
> > back to the user directory (permission denied). If it does not take
> > ownership of the file then there needs to be some way to keep pbs_mom
> from
> > deleting the file when it is done with it.
> >
> > I guess we could have a job attribute do_not_delete_spool_files that
> > pbs_server could set.  What do you think? Then I would skip the copy but
> I
> > would still have to make pbs_server know it owns the files so it cleans
> them
> > up after the keep_compled time expires.
>
> pbs_server can check if the file already exists, if so, hardlink it to its
> own
> name, never ask pbs_mom to copy the file, and rename it back to the
> original
> name?
>
> Seems kind of hackish, but possibly less ugly than figuring out the
> do_not_delete_spool_files job attr.



I guess the only problem I can think of is that if the file exists not
because the pbs_mom is local, but because the job already completed once,
was restarted from a checkpoint, and has now completed again. I think
pbs_server would see that the file was already in the spool directory and
would assume it was created by a local mom and would therefore not request a
new copy of the file so that if the job were restarted again from yet
another checkpoint file then an out of date .OU or .ER spool file would be
staged back out for the job.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20080711/ff0ad429/attachment.html


More information about the torquedev mailing list