Bug 121 - pbs_mom should fork to do per job directory removal
: pbs_mom should fork to do per job directory removal
Product: TORQUE
: 2.4.x
: Other Linux
: P5 normal
Assigned To: Ken Nielson
  Show dependency treegraph
Reported: 2011-04-19 19:44 MDT by Chris Samuel
Modified: 2011-10-13 12:29 MDT (History)
2 users (show)

See Also:



You need to log in before you can comment on or make changes to this bug.

Description Chris Samuel 2011-04-19 19:44:41 MDT
/* Found in 2.4, but seems unchanged in 2.5, 3.0 and trunk */

Currently pbs_mom effectively does (in the main thread of the process):

if (remtree($TMPDIR)) log "Oops, remtree failed"
seteuid(0) /* This is $pbsuser on 2.5 and later)

This means should the remtree hang (or take a long time) then pbs_mom itself
becomes completely unresponsive and cannot accept new connections or cleanup
when other existing jobs on the node exit.

Given that the only consequence of the remtree() failing is that pbs_mom
will log an error it should be safe to fork the whole sequence off and
exit() at the end, leaving pbs_mom free to do work.

There may be other places where this is important, but this is the particular
case that broke us when a filesystem bug caused remtree() to hang on some
nodes.. :-(
Comment 1 Ken Nielson 2011-05-05 08:53:34 MDT
should this be done for all calls to remtree or just in job_purge?
Comment 2 Chris Samuel 2011-05-10 20:09:55 MDT
Hi Ken, sorry for the delay, flat out here before I go into hospital tomorrow
for a minor op!

I reckon just job_purge for the moment, unless there are other points in the
critical path where it might be necessary ?
Comment 3 Ken Nielson 2011-05-20 15:45:38 MDT
Instead of a fork we create a thread from job_purge in the MOM code. This is
fixed in TORQUE 2.4.13, 2.5.6 and 3.0.2
Comment 4 Glen 2011-10-13 12:29:57 MDT
not sure if this is a consequence of doing this in a thread as described in
comment #3, but in TORQUE 2.5.6 we see cases where pbs_mom changes its UID
after cleaning up a tmpdir and then we get lots of permission denied errors in
the mom log and the node's state becomes "down"