[torquedev] Zombie Torque jobs
enstrom at ncsa.uiuc.edu
Fri Jun 22 09:53:10 MDT 2007
My apologies if this is a duplicate, I sent the first copy before I had
been added to the torquedev mailing list and I think it didn't go out.
We are experiencing some unkillable zombie jobs with Torque.
We have seen several jobs that end up being over walltime on a couple
Moab is trying to kill them but they don't go away unless qdel -p is used.
These are not jobs where the mother superior pbs_mom is
unreachable. Instead with these jobs I see a message "Premature end of
message from addr <ipaddress>" in the mother superior's log. The node
whose IP address is given is one of the sister nodes for the job. All of
the other sister nodes have "JOIN JOB as node <x>" in their logs. The
problem node has no entries for the job in its log. Running momctl on the
mother superior (and the rest of the sister nodes except the bad one) show
the job on the node in state=PRERUN. On the bad node it shows "no local
It looks like a momentary communication glitch on job start is causing the
problem. I can recreate this problem by running the pbs_mom under gdb on
one of the sister nodes. I put a break point in mom_comm.c:1854 where the
mom handles a JOIN_JOB request. When the mother superior sends its
JOIN_JOB request to the sister I leave it sitting at the break point. The
job then behaves the same as if the communication failure had happened.
If the sister node is down Torque correctly notices the start failure and
requeues the job. If the mom is running but does nothing with the JOIN_JOB
message the job ends up a zombie. In resmom/start_exec.c there is a
comment when the JOIN_JOB requests are sent to the sister nodes:
/* NOTE: does not check success of join request */
I did try setting mom_job_sync to True in qmgr. With this set I was still
able to get the job stuck in the PRERUN state by simulating an unresponsive
The biggest issue is that the job is unkillable. If this were fixed at
least the job would exit when it hit its wallclock limit. It would be
better if the job realized it was in trouble and requeued itself in the
first place. Perhaps if it timed out when it was in the PRERUN state too long.
Has anyone looked into this already?
More information about the torquedev