[torquedev] [Bug 121] New: pbs_mom should fork to do per job directory removal

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Tue Apr 19 19:44:42 MDT 2011


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=121

           Summary: pbs_mom should fork to do per job directory removal
           Product: TORQUE
           Version: 2.4.x
          Platform: Other
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P5
         Component: pbs_mom
        AssignedTo: knielson at adaptivecomputing.com
        ReportedBy: chris at csamuel.org
                CC: torquedev at supercluster.org
   Estimated Hours: 0.0


/* Found in 2.4, but seems unchanged in 2.5, 3.0 and trunk */

Currently pbs_mom effectively does (in the main thread of the process):

setegid($job_group)
seteuid($job_user)
if (remtree($TMPDIR)) log "Oops, remtree failed"
seteuid(0) /* This is $pbsuser on 2.5 and later)
setegid($pbsgroup)

This means should the remtree hang (or take a long time) then pbs_mom itself
becomes completely unresponsive and cannot accept new connections or cleanup
when other existing jobs on the node exit.

Given that the only consequence of the remtree() failing is that pbs_mom
will log an error it should be safe to fork the whole sequence off and
exit() at the end, leaving pbs_mom free to do work.

There may be other places where this is important, but this is the particular
case that broke us when a filesystem bug caused remtree() to hang on some
nodes.. :-(

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list