Bugzilla – Bug 121
pbs_mom should fork to do per job directory removal
Last modified: 2011-10-13 12:29:57 MDT
You need to log in before you can comment on or make changes to this bug.
/* Found in 2.4, but seems unchanged in 2.5, 3.0 and trunk */ Currently pbs_mom effectively does (in the main thread of the process): setegid($job_group) seteuid($job_user) if (remtree($TMPDIR)) log "Oops, remtree failed" seteuid(0) /* This is $pbsuser on 2.5 and later) setegid($pbsgroup) This means should the remtree hang (or take a long time) then pbs_mom itself becomes completely unresponsive and cannot accept new connections or cleanup when other existing jobs on the node exit. Given that the only consequence of the remtree() failing is that pbs_mom will log an error it should be safe to fork the whole sequence off and exit() at the end, leaving pbs_mom free to do work. There may be other places where this is important, but this is the particular case that broke us when a filesystem bug caused remtree() to hang on some nodes.. :-(
should this be done for all calls to remtree or just in job_purge?
Hi Ken, sorry for the delay, flat out here before I go into hospital tomorrow for a minor op! I reckon just job_purge for the moment, unless there are other points in the critical path where it might be necessary ?
Instead of a fork we create a thread from job_purge in the MOM code. This is fixed in TORQUE 2.4.13, 2.5.6 and 3.0.2
not sure if this is a consequence of doing this in a thread as described in comment #3, but in TORQUE 2.5.6 we see cases where pbs_mom changes its UID after cleaning up a tmpdir and then we get lots of permission denied errors in the mom log and the node's state becomes "down"