Bugzilla – Bug 121
pbs_mom should fork to do per job directory removal
Last modified: 2011-10-13 12:29:57 MDT
You need to
before you can comment on or make changes to this bug.
/* Found in 2.4, but seems unchanged in 2.5, 3.0 and trunk */
Currently pbs_mom effectively does (in the main thread of the process):
if (remtree($TMPDIR)) log "Oops, remtree failed"
seteuid(0) /* This is $pbsuser on 2.5 and later)
This means should the remtree hang (or take a long time) then pbs_mom itself
becomes completely unresponsive and cannot accept new connections or cleanup
when other existing jobs on the node exit.
Given that the only consequence of the remtree() failing is that pbs_mom
will log an error it should be safe to fork the whole sequence off and
exit() at the end, leaving pbs_mom free to do work.
There may be other places where this is important, but this is the particular
case that broke us when a filesystem bug caused remtree() to hang on some
should this be done for all calls to remtree or just in job_purge?
Hi Ken, sorry for the delay, flat out here before I go into hospital tomorrow
for a minor op!
I reckon just job_purge for the moment, unless there are other points in the
critical path where it might be necessary ?
Instead of a fork we create a thread from job_purge in the MOM code. This is
fixed in TORQUE 2.4.13, 2.5.6 and 3.0.2
not sure if this is a consequence of doing this in a thread as described in
comment #3, but in TORQUE 2.5.6 we see cases where pbs_mom changes its UID
after cleaning up a tmpdir and then we get lots of permission denied errors in
the mom log and the node's state becomes "down"