[torquedev] [Bug 218] New: Jobs getting stuck in exiting "job recycled into exiting on SIGNULL/KILL"
bugzilla-daemon at supercluster.org
bugzilla-daemon at supercluster.org
Sun Sep 30 23:08:47 MDT 2012
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=218
Summary: Jobs getting stuck in exiting "job recycled into
exiting on SIGNULL/KILL"
Product: TORQUE
Version: 2.4.x
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P5
Component: pbs_mom
AssignedTo: knielson at adaptivecomputing.com
ReportedBy: chris at csamuel.org
CC: torquedev at supercluster.org
Estimated Hours: 0.0
We are running 2.4, but I note that this code is essentially unchanged through
to 4.1.x.
On one of our clusters we are seeing bursts of jobs that end up stuck in an
EXITING state on various nodes for no apparent reason. For this month I see:
[root at merri-m ~]# xdsh compute -v 'fgrep -h "job recycled into exiting"
/var/spool/torque/mom_logs/201209*' | awk -F\; '{print $NF}' | sort | uniq -c
10 job recycled into exiting on SIGNULL/KILL from substate 1
9 job recycled into exiting on SIGNULL/KILL from substate 40
45 job recycled into exiting on SIGNULL/KILL from substate 42
50 job recycled into exiting on SIGNULL/KILL from substate 50
6 job recycled into exiting on SIGNULL/KILL from substate 53
76 job recycled into exiting on SIGNULL/KILL from substate 57
These messages are logged from line 2199 in
branches/2.4-fixes/src/resmom/requests.c and get triggered when signal 0 or 9
are sent to a task (either to see if it exists, or to kill it off,
respectively).
Currently our only solution we've found so far is to do either a momctl -c for
the job on its mother superior or qdel -p for the job. It would be nicer if
we could just get this to handle this case.
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
More information about the torquedev
mailing list