[torqueusers] safely restarting pbs_mom without killing jobs

Dave Jackson jacksond at supercluster.org
Mon Nov 1 18:31:51 MST 2004


Garrick,

  There probably will be comment after SuperComputing! :)  Lets look at
this together with the mpiexec developers then.

Dave

On Mon, 2004-11-01 at 18:06, Garrick Staples wrote:
> On Sun, Oct 31, 2004 at 05:14:47PM -0800, Garrick Staples alleged:
> > [subject change for the benefit of furture list searches]
> > 
> > On Mon, Nov 01, 2004 at 11:54:52AM +1100, Chris Samuel alleged:
> > > On Mon, 1 Nov 2004 11:47 am, Garrick Staples wrote:
> > > 
> > > > On Mon, Nov 01, 2004 at 10:16:50AM +1100, Chris Samuel alleged:
> > > >
> > > > > On Mon, 1 Nov 2004 10:08 am, Garrick Staples wrote:
> > > > >
> > > > > > I must admit, I haven't tested this with kill -9!
> > > > > >
> > > > > > Recipe for success: never let pbs_mom die gracefully and always start
> > > > > > with -p?
> > > > >
> > > > > That's what *seems* to work here, caveat emptor.
> > > > >
> > > > > Don't blame us if it eats your dog.. :-)
> > > >
> > > > Unbelievable! ?That seems to work perfectly!
> > > 
> > > Phew, I can start breathing again now..  :-)
> > > 
> > > Garrick, thanks for the confirmation that it's not just a fluke here at VPAC!
> > > 
> > > Hopefully this will help other Torquies too..
> > 
> > This is terrific.  I've got initscripts that start pbs_mom with -p on boot
> > (this *will* clear the job if a node reboots during a job), do a normal kill on
> > machine shutdown (again, clear the job if the machine is going down), but
> > always use kill -9 and -p any other time.
> 
> Turns out that this breaks mpiexec jobs if you restart the MOM pbs_mom.  I
> knew it was too good to be true =(
> 
> pbs_mom starts scrolling these errors *very* quickly when it comes back up and
> mpiexec is trying to reconnect:
> 
> 11/01/2004 16:55:17;0001;   pbs_mom;Svr;pbs_mom;tm_eof, matching task located, marking interface closed
> 11/01/2004 16:55:17;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor (9) in tm_request, comm failed Protocol failure in commit
> 11/01/2004 16:55:17;0001;   pbs_mom;Svr;pbs_mom;tm_eof, matching task located, marking interface closed
> 11/01/2004 16:55:17;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor (9) in tm_request, non-local connect
> 11/01/2004 16:55:17;0001;   pbs_mom;Svr;pbs_mom;task_check, cannot tm_reply to 13273.hpc-master.usc.edu task 1
> 11/01/2004 16:55:17;0001;   pbs_mom;Svr;pbs_mom;task_check, cannot tm_reply to 13273.hpc-master.usc.edu task 1
> 
> Any comments from the Torque developers?  Is there any hope for restarting
> pbs_mom?
> 



More information about the torqueusers mailing list