Bugzilla – Bug 218
Jobs getting stuck in exiting "job recycled into exiting on SIGNULL/KILL"
Last modified: 2012-12-11 09:50:55 MST
You need to log in before you can comment on or make changes to this bug.
We are running 2.4, but I note that this code is essentially unchanged through to 4.1.x. On one of our clusters we are seeing bursts of jobs that end up stuck in an EXITING state on various nodes for no apparent reason. For this month I see: [root@merri-m ~]# xdsh compute -v 'fgrep -h "job recycled into exiting" /var/spool/torque/mom_logs/201209*' | awk -F\; '{print $NF}' | sort | uniq -c 10 job recycled into exiting on SIGNULL/KILL from substate 1 9 job recycled into exiting on SIGNULL/KILL from substate 40 45 job recycled into exiting on SIGNULL/KILL from substate 42 50 job recycled into exiting on SIGNULL/KILL from substate 50 6 job recycled into exiting on SIGNULL/KILL from substate 53 76 job recycled into exiting on SIGNULL/KILL from substate 57 These messages are logged from line 2199 in branches/2.4-fixes/src/resmom/requests.c and get triggered when signal 0 or 9 are sent to a task (either to see if it exists, or to kill it off, respectively). Currently our only solution we've found so far is to do either a momctl -c for the job on its mother superior or qdel -p for the job. It would be nicer if we could just get this to handle this case.
I can confirm this is still happening with latest RHEL 5.8 updates. We did a full reinstall of the affected cluster last week but it's still happening: [root@merri-m ~]# xdsh compute -v 'fgrep -h "job recycled into exiting" /var/spool/torque/mom_logs/201210*' | awk -F\; '{print $NF}' | sort | uniq -c 2 job recycled into exiting on SIGNULL/KILL from substate 42 1 job recycled into exiting on SIGNULL/KILL from substate 50 19 job recycled into exiting on SIGNULL/KILL from substate 57 Any ideas please? It's driving us (and our users) nuts..
Based on my reading of the code, the key factor here isn't just that signal 0 or 9 is being sent to the job, but specifically that there are no processes which received it. The job states you mention below vary widely (from RUNNING to PREOBIT to EXITING and others), so I'm not sure that's really significant. I think the key point is that the processes have vanished, though. I can confirm that we were seeing it on one of our RHEL5-based clusters when it was running 2.5.x, but after recently upgrading it to 4.1.1, we haven't seen that message at all since then (i.e., in over a month).
Interesting, though something else I've noticed is an error that seems to happen beforehand, when the job launches, saying (for example): 11/09/2012 13:36:00;0008; pbs_mom;Job;449269-923.merri-m.pcf.vlsci.unimelb.edu.au;JOIN JOB as node 1 11/09/2012 13:36:00;0008; pbs_mom;Job;449269-923.merri-m.pcf.vlsci.unimelb.edu.au;start_process: task started, tid 2, sid 27290, cmd orted 11/09/2012 13:36:01;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Success (0) in req_quejob, cannot queue new job, job exists and is running 11/09/2012 13:36:01;0080; pbs_mom;Req;req_reject;Reject reply code=15009(Job with requested ID already exists MSG=job is running), aux=0, type=QueueJob, from PBS_Server@merri-m.pcf.vlsci.unimelb.edu.au I'm wondering if this spurious second start message is confusing the pbs_mom..
This is fixed in 4.1.4, but there currently are no plans to backport this to the 2.4 series of code.
(In reply to comment #4) > This is fixed in 4.1.4, but there currently are no plans to backport this to > the 2.4 series of code. Hi David, Any chance of a commit ID for that so I can look at what's involved please? thanks! Chris
(In reply to comment #5) > > Any chance of a commit ID for that so I can look at what's involved please? > Of course. In 4.1-fixes (svn commit) its revision 7175. In 4.1-dev on git its commit 35d74cb144fdddd2dfd0be272c5f1715ae2d6715 David