robbins at physik.uni-wuppertal.de
Thu Dec 8 03:30:53 MST 2005
thanks for all the responses.
On 7. Dec 2005, at 10:05 PM, Garrick Staples wrote:
> On Wed, Dec 07, 2005 at 03:53:59PM +0100, Simon Robbins alleged:
>> Hi Garrick,
>> I don't like the environment variable name "TMPDIR"
>> used in torque 2. I attach the patches that I think
>> are needed to change the name of this to PBS_TMPDIR,
>> this seems to work on AMD Opterons.
>> Until torque2 I was using an old patch and changing
>> the name of this variable to PBS_TMPDIR.
>> Unfortunately I have just had a user who was until
>> now been innocently using $TMPDIR; torque deleted
>> the contents of this directory when the job finished :-(
>> What do people think about re-naming this variable?
> So the user is supplying their own TMPDIR variable to the job, set
> to a
> directory that doesn't exist, TORQUE created the directory, and the
> was surprised when TORQUE deleted it after the job?
The variable "$tmpdir" is defined in the mom config:
where /data is a disk local to the node, this works great :-)
The user was using a set of scripts that used the environment
variable $TMPDIR and the qsub -v option:
qsub ... -v "TMPDIR=/some/directory" ...
Torque then interpreted this as a cue to use "$TMPDIR" as the
transient tmpdir for the job and deleted the contents when the job
finished (in the scripts: $TMPDIR was the ROOT directory where the
temporary directory should be created (e.g. /tmp)).
Anyway, the point is the user was not aware of this new behaviour and
was surprised when their scripts suddenly stopped working :-)
My suggestion is to rename the variable responsible for this
behaviour to "PBS_TMPDIR". I use the qsub wrapper script to define
$TMPDIR to be the same as PBS_TMPDIR inside all jobs. Only if the
user defines "qsub -v "PBS_TMPDIR=/some/dir" will Torque use that as
the transient tmpdir and will ignore TMPDIR. I don't want to change
the behaviour just the name.
P.S. I also see problems for removal of links e.g.:
> cd $TMPDIR
> mkdir a_directory
> ln -s a_directory link_to_dir
when torque removes this recursively it removes (alphabetically)
"a_directory" first and then can't remove the link "link_to_dir"
because it isn't valid (mom_log):
12/08/2005 10:43:06;0080; pbs_mom;Job;
9428.whiterabbit.alicenext;Removing transient job directory /data/
12/08/2005 10:43:06;0001; pbs_mom;Svr;pbs_mom;No such file or
directory (2) in remtree, stat
12/08/2005 10:43:06;0001; pbs_mom;Svr;pbs_mom;Directory not empty
(39) in remtree, rmdir failed on /data/9428.whiterabbit.alicenext
12/08/2005 10:43:06;0001; pbs_mom;Svr;pbs_mom;Inappropriate ioctl
for device (25) in recursive (r)rmdir, recursive remove of job
transient tmpdir /data/9428.whiterabbit.alicenext failed
This didn't happen with the old transient-tmpdir patch :-(
More information about the torqueusers