[torquedev] [Bug 218] New: Jobs getting stuck in exiting "job recycled into exiting on SIGNULL/KILL"

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Sun Sep 30 23:08:47 MDT 2012


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=218

           Summary: Jobs getting stuck in exiting "job recycled into
                    exiting on SIGNULL/KILL"
           Product: TORQUE
           Version: 2.4.x
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P5
         Component: pbs_mom
        AssignedTo: knielson at adaptivecomputing.com
        ReportedBy: chris at csamuel.org
                CC: torquedev at supercluster.org
   Estimated Hours: 0.0


We are running 2.4, but I note that this code is essentially unchanged through
to 4.1.x.

On one of our clusters we are seeing bursts of jobs that end up stuck in an
EXITING state on various nodes for no apparent reason.   For this month I see:

[root at merri-m ~]# xdsh compute -v 'fgrep -h "job recycled into exiting"
/var/spool/torque/mom_logs/201209*' | awk -F\; '{print $NF}' | sort | uniq -c
     10 job recycled into exiting on SIGNULL/KILL from substate 1
      9 job recycled into exiting on SIGNULL/KILL from substate 40
     45 job recycled into exiting on SIGNULL/KILL from substate 42
     50 job recycled into exiting on SIGNULL/KILL from substate 50
      6 job recycled into exiting on SIGNULL/KILL from substate 53
     76 job recycled into exiting on SIGNULL/KILL from substate 57

These messages are logged from line 2199 in
branches/2.4-fixes/src/resmom/requests.c and get triggered when signal 0 or 9
are sent to a task (either to see if it exists, or to kill it off,
respectively).

Currently our only solution we've found so far is to do either a momctl -c for
the job on its mother superior or qdel -p for the job.   It would be nicer if
we could just get this to handle this case.

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list