[torqueusers] safely restarting pbs_mom without killing jobs

Garrick Staples garrick at usc.edu
Mon Nov 1 18:06:23 MST 2004


On Sun, Oct 31, 2004 at 05:14:47PM -0800, Garrick Staples alleged:
> [subject change for the benefit of furture list searches]
> 
> On Mon, Nov 01, 2004 at 11:54:52AM +1100, Chris Samuel alleged:
> > On Mon, 1 Nov 2004 11:47 am, Garrick Staples wrote:
> > 
> > > On Mon, Nov 01, 2004 at 10:16:50AM +1100, Chris Samuel alleged:
> > >
> > > > On Mon, 1 Nov 2004 10:08 am, Garrick Staples wrote:
> > > >
> > > > > I must admit, I haven't tested this with kill -9!
> > > > >
> > > > > Recipe for success: never let pbs_mom die gracefully and always start
> > > > > with -p?
> > > >
> > > > That's what *seems* to work here, caveat emptor.
> > > >
> > > > Don't blame us if it eats your dog.. :-)
> > >
> > > Unbelievable! ?That seems to work perfectly!
> > 
> > Phew, I can start breathing again now..  :-)
> > 
> > Garrick, thanks for the confirmation that it's not just a fluke here at VPAC!
> > 
> > Hopefully this will help other Torquies too..
> 
> This is terrific.  I've got initscripts that start pbs_mom with -p on boot
> (this *will* clear the job if a node reboots during a job), do a normal kill on
> machine shutdown (again, clear the job if the machine is going down), but
> always use kill -9 and -p any other time.

Turns out that this breaks mpiexec jobs if you restart the MOM pbs_mom.  I
knew it was too good to be true =(

pbs_mom starts scrolling these errors *very* quickly when it comes back up and
mpiexec is trying to reconnect:

11/01/2004 16:55:17;0001;   pbs_mom;Svr;pbs_mom;tm_eof, matching task located, marking interface closed
11/01/2004 16:55:17;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor (9) in tm_request, comm failed Protocol failure in commit
11/01/2004 16:55:17;0001;   pbs_mom;Svr;pbs_mom;tm_eof, matching task located, marking interface closed
11/01/2004 16:55:17;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor (9) in tm_request, non-local connect
11/01/2004 16:55:17;0001;   pbs_mom;Svr;pbs_mom;task_check, cannot tm_reply to 13273.hpc-master.usc.edu task 1
11/01/2004 16:55:17;0001;   pbs_mom;Svr;pbs_mom;task_check, cannot tm_reply to 13273.hpc-master.usc.edu task 1

Any comments from the Torque developers?  Is there any hope for restarting
pbs_mom?


-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20041101/06f5a327/attachment.bin


More information about the torqueusers mailing list