Bug 218 - Jobs getting stuck in exiting "job recycled into exiting on SIGNULL/KILL"
: Jobs getting stuck in exiting "job recycled into exiting on SIGNULL/KILL"
Status: NEW
Product: TORQUE
pbs_mom
: 2.4.x
: PC Linux
: P5 normal
Assigned To: Ken Nielson
:
:
:
  Show dependency treegraph
 
Reported: 2012-09-30 23:08 MDT by Chris Samuel
Modified: 2012-12-11 09:50 MST (History)
3 users (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description Chris Samuel 2012-09-30 23:08:46 MDT
We are running 2.4, but I note that this code is essentially unchanged through
to 4.1.x.

On one of our clusters we are seeing bursts of jobs that end up stuck in an
EXITING state on various nodes for no apparent reason.   For this month I see:

[root@merri-m ~]# xdsh compute -v 'fgrep -h "job recycled into exiting"
/var/spool/torque/mom_logs/201209*' | awk -F\; '{print $NF}' | sort | uniq -c
     10 job recycled into exiting on SIGNULL/KILL from substate 1
      9 job recycled into exiting on SIGNULL/KILL from substate 40
     45 job recycled into exiting on SIGNULL/KILL from substate 42
     50 job recycled into exiting on SIGNULL/KILL from substate 50
      6 job recycled into exiting on SIGNULL/KILL from substate 53
     76 job recycled into exiting on SIGNULL/KILL from substate 57

These messages are logged from line 2199 in
branches/2.4-fixes/src/resmom/requests.c and get triggered when signal 0 or 9
are sent to a task (either to see if it exists, or to kill it off,
respectively).

Currently our only solution we've found so far is to do either a momctl -c for
the job on its mother superior or qdel -p for the job.   It would be nicer if
we could just get this to handle this case.
Comment 1 Chris Samuel 2012-10-10 21:36:36 MDT
I can confirm this is still happening with latest RHEL 5.8 updates. We did a
full reinstall of the affected cluster last week but it's still happening:

[root@merri-m ~]# xdsh compute -v 'fgrep -h "job recycled into exiting"
/var/spool/torque/mom_logs/201210*' | awk -F\; '{print $NF}' | sort | uniq -c
      2 job recycled into exiting on SIGNULL/KILL from substate 42
      1 job recycled into exiting on SIGNULL/KILL from substate 50
     19 job recycled into exiting on SIGNULL/KILL from substate 57

Any ideas please?    It's driving us (and our users) nuts..
Comment 2 Michael Jennings 2012-10-11 12:11:15 MDT
Based on my reading of the code, the key factor here isn't just that signal 0
or 9 is being sent to the job, but specifically that there are no processes
which received it.  The job states you mention below vary widely (from RUNNING
to PREOBIT to EXITING and others), so I'm not sure that's really significant.

I think the key point is that the processes have vanished, though.

I can confirm that we were seeing it on one of our RHEL5-based clusters when it
was running 2.5.x, but after recently upgrading it to 4.1.1, we haven't seen
that message at all since then (i.e., in over a month).
Comment 3 Chris Samuel 2012-11-08 21:14:29 MST
Interesting, though something else I've noticed is an error that seems to
happen beforehand, when the job launches, saying (for example):

11/09/2012 13:36:00;0008;  
pbs_mom;Job;449269-923.merri-m.pcf.vlsci.unimelb.edu.au;JOIN JOB as node 1
11/09/2012 13:36:00;0008;  
pbs_mom;Job;449269-923.merri-m.pcf.vlsci.unimelb.edu.au;start_process: task
started, tid 2, sid 27290, cmd orted
11/09/2012 13:36:01;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Success (0) in
req_quejob, cannot queue new job, job exists and is running
11/09/2012 13:36:01;0080;   pbs_mom;Req;req_reject;Reject reply code=15009(Job
with requested ID already exists MSG=job is running), aux=0, type=QueueJob,
from PBS_Server@merri-m.pcf.vlsci.unimelb.edu.au

I'm wondering if this spurious second start message is confusing the pbs_mom..
Comment 4 David Beer 2012-12-10 17:30:31 MST
This is fixed in 4.1.4, but there currently are no plans to backport this to
the 2.4 series of code.
Comment 5 Chris Samuel 2012-12-10 21:55:36 MST
(In reply to comment #4)
> This is fixed in 4.1.4, but there currently are no plans to backport this to
> the 2.4 series of code.

Hi David,

Any chance of a commit ID for that so I can look at what's involved please?

thanks!
Chris
Comment 6 David Beer 2012-12-11 09:50:55 MST
(In reply to comment #5)

> 
> Any chance of a commit ID for that so I can look at what's involved please?
> 

Of course. In 4.1-fixes (svn commit) its revision 7175. In 4.1-dev on git its
commit 35d74cb144fdddd2dfd0be272c5f1715ae2d6715

David