[torquedev] Zombie Torque jobs

Peter Enstrom enstrom at ncsa.uiuc.edu
Fri Jun 22 09:53:10 MDT 2007

My apologies if this is a duplicate, I sent the first copy before I had 
been added to the torquedev mailing list and I think it didn't go out.

We are experiencing some unkillable zombie jobs with Torque.

We have seen several jobs that end up being over walltime on a couple 
different systems.
Moab is trying to kill them but they don't go away unless qdel -p is used.

These are not jobs where the mother superior pbs_mom is 
unreachable.  Instead with these jobs I see a message "Premature end of 
message from addr <ipaddress>"  in the mother superior's log. The node 
whose IP address is given is one of the sister nodes for the job.  All of 
the other sister nodes have "JOIN JOB as node <x>" in their logs.  The 
problem node has no entries for the job in its log.  Running momctl on the 
mother superior (and the rest of the sister nodes except the bad one) show 
the job on the node in state=PRERUN.  On the bad node it shows "no local 
jobs detected".

It looks like a momentary communication glitch on job start is causing the 
problem. I can recreate this problem by running the pbs_mom under gdb on 
one of the sister nodes.  I put a break point in mom_comm.c:1854 where the 
mom handles a JOIN_JOB request.  When the mother superior sends its 
JOIN_JOB request to the sister I leave it sitting at the break point.  The 
job then behaves the same as if the communication failure had happened.

If the sister node is down Torque correctly notices the start failure and 
requeues the job.  If the mom is running but does nothing with the JOIN_JOB 
message the job ends up a zombie.  In resmom/start_exec.c there is a 
comment when the JOIN_JOB requests are sent to the sister nodes:
     /* NOTE:  does not check success of join request */

I did try setting mom_job_sync to True in qmgr. With this set I was still 
able to get the job stuck in the PRERUN state by simulating an unresponsive 
sister node.

The biggest issue is that the job is unkillable.  If this were fixed at 
least the job would exit when it hit its wallclock limit.  It would be 
better if the job realized it was in trouble and requeued itself in the 
first place.  Perhaps if it timed out when it was in the PRERUN state too long.

Has anyone looked into this already?

   Peter Enstrom

More information about the torquedev mailing list