[torquedev] proposed change in directory structure

Glen Beane glen.beane at gmail.com
Fri Jul 11 14:59:41 MDT 2008

On Fri, Jul 11, 2008 at 4:57 PM, Glen Beane <glen.beane at gmail.com> wrote:

> On Fri, Jul 11, 2008 at 4:43 PM, Garrick Staples <garrick at usc.edu> wrote:
>> On Fri, Jul 11, 2008 at 04:28:43PM -0400, Glen Beane alleged:
>> > I've been working on some changes in trunk that transfer the .OU and .ER
>> > spool files from pbs_mom back to pbs_server. This is one of the steps we
>> > need to take so that a job in the COMPLETE state can be restarted from a
>> > checkpoint file.  (the files are only returned to the server if
>> > keep_completed is positive and the job has a checkpoint file)
>> >
>> > There are problems when the spool file is shared between pbs_server and
>> the
>> > mother superior pbs_mom. What happens is that when the files are
>> "returned"
>> > pbs_server takes ownership of the .ER and .OU files in the spool dir and
>> > when pbs_mom forks to the user to copy the files back to the user home
>> > directory they are unable to do so because of a permission denied error.
>>  I
>> > feel that the cleanest solution is to just separate the pbs_server and
>> > pbs_mom spool directories.  In my current working copy of trunk I have
>> > changed pbs_server to use server_home/server_spool instead of
>> > server_home/spool.  pbs_mom continues to use server_home/spool.  This
>> solves
>> > my problems because when the spool files are returned to pbs_server
>> pbs_mom
>> > retains its copy it its own spool directory. It is then free to fork to
>> the
>> > user to copy the files and then delete them.
>> >
>> > Are there any objections to this change in trunk? (the change will be
>> > introduced with the release of TORQUE 2.4.0)
>> So we're doing a useless copy from server_home/spool to
>> server_home/server_spool?   At my site, these files are often a
>> significant
>> percentage of the filesystem.  If a file is more than 50% of the total
>> filesystem, then this is going to fail.
>> Why not just have the server check if it already has the file and not
>> issue a
>> copy request?
> you probably don't run pbs_mom and pbs_server on the same host do you?  ;)
> I think 99% of the time the copy from pbs_mom to pbs_server is going to be
> required.
> As for this special case, yes if they can share the same spool directory
> then it would be good so we don't have to do the copy, however the problem
> is letting pbs_mom know that pbs_server is using the same spool directory.
> pbs_mom assumes it owns the file and will delete it when it is done. If
> pbs_server takes ownership of the file then it either has to make it world
> readable (then anyone can snoop on the contents of the .OU and .ER files
> while the job is in the COMPLETE state) or pbs_mom can not copy the file
> back to the user directory (permission denied). If it does not take
> ownership of the file then there needs to be some way to keep pbs_mom from
> deleting the file when it is done with it.
> I guess we could have a job attribute do_not_delete_spool_files that
> pbs_server could set.  What do you think? Then I would skip the copy but I
> would still have to make pbs_server know it owns the files so it cleans them
> up after the keep_compled time expires.
> This brings up another issue.  If there are a lot of checkpointed jobs in
> the COMPLETE state then that means there can be a huge amount of data that
> has to be stored by pbs_server.

So you've helped convince me that the easy way out is a bad idea (is it
ever?).  Thanks a lot Garrick. ;)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20080711/f5ba76a3/attachment-0001.html

More information about the torquedev mailing list