[torqueusers] RE: Zombie Torque jobs

Steve Angelovich sangelovich at lgc.com
Wed Oct 10 16:08:28 MDT 2007


Sorry if this is a repost but I don't think the original message made
it.


On Mon, 2007-10-08 at 21:22 -0600, Steve Angelovich wrote:
> Does anybody know if there is a solution to the problem that Peter
> described below?  I think we are running into the same issue or
> something similar.
> 
> I've inserted the relevant sections from the mom logs on the mother
> superior node and one of the sister nodes.  
> 
> Thanks,
> Steve
> 
> 
> 10/08/2007 11:06:55;0002;   pbs_mom;Svr;im_eof;Premature end of message
> from addr 10.0.1.42:15003
> 10/08/2007 11:06:55;0001;   pbs_mom;Svr;pbs_mom;sister could not
> communicate (15059) in 62039.h1, job_start_error from node q2 in
> job_start_error
> 10/08/2007 11:06:55;0008;   pbs_mom;Req;send_sisters;sending ABORT to
> sisters
> 10/08/2007 11:06:55;0001;   pbs_mom;Svr;pbs_mom;job_start_error,
> job_start_error: sent 2 ABORT requests, should be 3
> 10/08/2007 11:06:55;0008;   pbs_mom;Req;send_sisters;sending ABORT to
> sisters
> 10/08/2007 11:06:55;0001;   pbs_mom;Svr;pbs_mom;node_bailout,
> node_bailout: received KILL/ABORT request for job 62039.h1 from node q2
> 10/08/2007 11:06:55;0001;   pbs_mom;Svr;pbs_mom;node_bailout,
> node_bailout: received KILL/ABORT request for job 62039.h1 from node q2
> 10/08/2007 11:06:55;0008;   pbs_mom;Job;62039.h1;ERROR:    received
> request 'ERROR' from 10.0.1.42:15003 for job '62039.h1' (job does not
> exist locally)
> 10/08/2007 11:06:55;0008;   pbs_mom;Job;62039.h1;ERROR:    received
> request 'ERROR' from 10.0.1.42:15003 for job '62039.h1' (job does not
> exist locally)
> 10/08/2007 11:06:57;0008;   pbs_mom;Job;62039.h1;JOIN JOB as node 2
> ~
> ~
> 10/08/2007 11:03:47;0001;   pbs_mom;Svr;pbs_mom;sister could not
> communicate (15059) in 62039.h1, job_start_error from node e5 in
> job_start_error
> 10/08/2007 11:03:47;0008;   pbs_mom;Req;send_sisters;sending ABORT to
> sisters
> 10/08/2007 11:03:47;0008;   pbs_mom;Req;send_sisters;sending ABORT to
> sisters
> 10/08/2007 11:03:47;0001;   pbs_mom;Svr;pbs_mom;node_bailout,
> node_bailout: received KILL/ABORT request for job 62039.h1 from node e5
> 10/08/2007 11:03:47;0001;   pbs_mom;Svr;pbs_mom;node_bailout,
> node_bailout: received KILL/ABORT request for job 62039.h1 from node e5
> 10/08/2007 11:03:47;0001;   pbs_mom;Svr;pbs_mom;im_request, event 15
> taskid 0 not found
> 10/08/2007 11:03:47;0001;   pbs_mom;Svr;pbs_mom;im_request, job
> 62039.h1: command 99
> 10/08/2007 11:03:47;0002;   pbs_mom;Svr;im_eof;No error from addr
> 10.0.1.5:15003
> 10/08/2007 11:03:48;0008;   pbs_mom;Job;62039.h1;JOIN JOB as node 1
> 10/08/2007 11:03:51;0008;   pbs_mom;Job;62102.h1;JOIN JOB as node 2
> 10/08/2007 11:04:40;0008;   pbs_mom;Job;62116.h1;JOIN JOB as node 1
> 10/08/2007 11:06:55;0008;   pbs_mom;Job;62039.h1;ERROR:    received
> request 'ABORT_JOB' from 10.0.1.22:1021 for job '62039.h1' ( job does
> not exist locally)
> 
> 
> 
> > We are experiencing some unkillable zombie jobs with Torque.
> > 
> > We have seen several jobs that end up being over walltime on a couple 
> > different systems.
> > Moab is trying to kill them but they don't go away unless qdel -p is used.
> > 
> > These are not jobs where the mother superior pbs_mom is 
> > unreachable.  Instead with these jobs I see a message "Premature end of 
> > message from addr <ipaddress>"  in the mother superior's log. The node 
> > whose IP address is given is one of the sister nodes for the job.  All of 
> > the other sister nodes have "JOIN JOB as node <x>" in their logs.  The 
> > problem node has no entries for the job in its log.  Running momctl on the 
> > mother superior (and the rest of the sister nodes except the bad one) show 
> > the job on the node in state=PRERUN.  On the bad node it shows "no local 
> > jobs detected".
> > 
> > It looks like a momentary communication glitch on job start is causing the 
> > problem. I can recreate this problem by running the pbs_mom under gdb on 
> > one of the sister nodes.  I put a break point in mom_comm.c:1854 where the 
> > mom handles a JOIN_JOB request.  When the mother superior sends its 
> > JOIN_JOB request to the sister I leave it sitting at the break point.  The 
> > job then behaves the same as if the communication failure had happened.
> > 
> > If the sister node is down Torque correctly notices the start failure and 
> > requeues the job.  If the mom is running but does nothing with the JOIN_JOB 
> > message the job ends up a zombie.  In resmom/start_exec.c there is a 
> > comment when the JOIN_JOB requests are sent to the sister nodes:
> >      /* NOTE:  does not check success of join request */
> > 
> > I did try setting mom_job_sync to True in qmgr. With this set I was still 
> > able to get the job stuck in the PRERUN state by simulating an unresponsive 
> > sister node.
> > 
> > The biggest issue is that the job is unkillable.  If this were fixed at 
> > least the job would exit when it hit its wallclock limit.  It would be 
> > better if the job realized it was in trouble and requeued itself in the 
> > first place.  Perhaps if it timed out when it was in the PRERUN state too long.
> > 
> > Has anyone looked into this already?
> > 
> > Thanks,
> >    Peter Enstrom
> >    NCSA

----------------------------------------------------------------------
This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient.  Any review, use, distribution, or disclosure by others is strictly prohibited.  If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message.


More information about the torqueusers mailing list