[torqueusers] safely restarting pbs_mom without killing jobs
Dave Jackson
jacksond at supercluster.org
Mon Nov 1 18:31:51 MST 2004
Garrick,
There probably will be comment after SuperComputing! :) Lets look at
this together with the mpiexec developers then.
Dave
On Mon, 2004-11-01 at 18:06, Garrick Staples wrote:
> On Sun, Oct 31, 2004 at 05:14:47PM -0800, Garrick Staples alleged:
> > [subject change for the benefit of furture list searches]
> >
> > On Mon, Nov 01, 2004 at 11:54:52AM +1100, Chris Samuel alleged:
> > > On Mon, 1 Nov 2004 11:47 am, Garrick Staples wrote:
> > >
> > > > On Mon, Nov 01, 2004 at 10:16:50AM +1100, Chris Samuel alleged:
> > > >
> > > > > On Mon, 1 Nov 2004 10:08 am, Garrick Staples wrote:
> > > > >
> > > > > > I must admit, I haven't tested this with kill -9!
> > > > > >
> > > > > > Recipe for success: never let pbs_mom die gracefully and always start
> > > > > > with -p?
> > > > >
> > > > > That's what *seems* to work here, caveat emptor.
> > > > >
> > > > > Don't blame us if it eats your dog.. :-)
> > > >
> > > > Unbelievable! ?That seems to work perfectly!
> > >
> > > Phew, I can start breathing again now.. :-)
> > >
> > > Garrick, thanks for the confirmation that it's not just a fluke here at VPAC!
> > >
> > > Hopefully this will help other Torquies too..
> >
> > This is terrific. I've got initscripts that start pbs_mom with -p on boot
> > (this *will* clear the job if a node reboots during a job), do a normal kill on
> > machine shutdown (again, clear the job if the machine is going down), but
> > always use kill -9 and -p any other time.
>
> Turns out that this breaks mpiexec jobs if you restart the MOM pbs_mom. I
> knew it was too good to be true =(
>
> pbs_mom starts scrolling these errors *very* quickly when it comes back up and
> mpiexec is trying to reconnect:
>
> 11/01/2004 16:55:17;0001; pbs_mom;Svr;pbs_mom;tm_eof, matching task located, marking interface closed
> 11/01/2004 16:55:17;0001; pbs_mom;Svr;pbs_mom;Bad file descriptor (9) in tm_request, comm failed Protocol failure in commit
> 11/01/2004 16:55:17;0001; pbs_mom;Svr;pbs_mom;tm_eof, matching task located, marking interface closed
> 11/01/2004 16:55:17;0001; pbs_mom;Svr;pbs_mom;Bad file descriptor (9) in tm_request, non-local connect
> 11/01/2004 16:55:17;0001; pbs_mom;Svr;pbs_mom;task_check, cannot tm_reply to 13273.hpc-master.usc.edu task 1
> 11/01/2004 16:55:17;0001; pbs_mom;Svr;pbs_mom;task_check, cannot tm_reply to 13273.hpc-master.usc.edu task 1
>
> Any comments from the Torque developers? Is there any hope for restarting
> pbs_mom?
>
More information about the torqueusers
mailing list