[torquedev] [Bug 218] Jobs getting stuck in exiting "job recycled into exiting on SIGNULL/KILL"

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Thu Oct 11 12:11:15 MDT 2012


Michael Jennings <mej at lbl.gov> changed:

           What    |Removed                     |Added
                 CC|                            |mej at lbl.gov

--- Comment #2 from Michael Jennings <mej at lbl.gov> 2012-10-11 12:11:15 MDT ---
Based on my reading of the code, the key factor here isn't just that signal 0
or 9 is being sent to the job, but specifically that there are no processes
which received it.  The job states you mention below vary widely (from RUNNING
to PREOBIT to EXITING and others), so I'm not sure that's really significant.

I think the key point is that the processes have vanished, though.

I can confirm that we were seeing it on one of our RHEL5-based clusters when it
was running 2.5.x, but after recently upgrading it to 4.1.1, we haven't seen
that message at all since then (i.e., in over a month).

Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

More information about the torquedev mailing list