[torqueusers] BUG: MOM segfaults

Garrick Staples garrick at usc.edu
Wed Feb 2 15:26:08 MST 2005


After reading everything below and looking through the code some more.  I still
don't think that call to set_globid() is needed.  Maybe it was needed with
openpbs 2.3.12, but not with recent torques.

In addition, I'm realizing that mpiexec still doesn't work after restarting a
mom.  I think the main reason is that the ji_stdout and ji_stderr port numbers
aren't saved with the job, the restarted mom can't contact the original
pbs_demux when a new TM_SPAWN request comes in.

I'm still looking into this stuff, so I may be changing my mind as I sort
everything out.


On Wed, Feb 02, 2005 at 11:35:55AM +1100, Chris Samuel alleged:
> /* CC'd to the mpiexec mailing list for Pete to comment on */
> 
> On Wed, 2 Feb 2005 10:44 am, Garrick Staples wrote:
> 
> > On Tue, Feb 01, 2005 at 10:50:22AM -0700, Marc Aurele La France alleged:
> > > Hi.
> > >
> > > init_abort_job() in src/resmom/catch_child.c contains a call to
> > > set_globid(pj,NULL).  Consequently, it behooves all set_globid()
> >
> > I looked through the code a bit and began to doubt whether that
> > set_globid() call in init_abort_job() was actually required.  The comment
> > says it came from the mpiexec patch.  I commented it out and was able to
> > run new jobs with mpiexec just fine.
> >
> > Anyone from the mpiexec crowd know about this and can comment?
> 
> This has come from the mpiexec patch against OpenPBS 2.3.12 to stop a 
> restarting mom from killing a job launched by mpiexec:
> 
>  mpiexec-0.77/patch/pbs-2.3.12-mom-restart.diff
> 
> The relevant fragment in that patch says:
> 
>  /* set the globid so mom does not coredump in response
>   * to tm_spawn */
>  set_globid(pj, 0);
> 
> This patch and a description is listed in Pete's collection of OpenPBS patches 
> at:
> 
>  http://www.osc.edu/~pw/pbs/
> 
> It says:
> 
> mom-restart.patch - Track running jobs properly across a mom restart. 
> 
> For mpiexec-spawned jobs to survive across a mom restart, and to enable proper 
> accounting for all jobs which continue across a mom restart, this patch fixes 
> some behavior of mom when restarted with the "-p" flag. Note that this patch 
> adds functionality to the machine-specific part of the mom code for linux 
> only. Users of other system types could cut-n-paste that code without too 
> much problem, but as it stands, this patch will break compilation on 
> non-linux systems. 
> 
> This patch does four things: 
> 
> - Fix coredump resulting from tm_spawn to restarted pbs_mom 
> - Avoid race condition by which pbs_mom would sometimes kill itself as tasks 
> exit. 
> - Make a restarted pbs_mom search for and report exiting tasks from jobs which 
> were started before the old mom was killed. 
> - Change response of pbs_mom to various signals. Now the default is to leave 
> all jobs running. If you want to stop all jobs, USR1 can be used to achieve 
> the old behavior.
> 
> 
> cheers!
> Chris
> -- 
>  Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
>  Victorian Partnership for Advanced Computing http://www.vpac.org/
>  Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
> 



> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers


-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050202/35a4e669/attachment.bin


More information about the torqueusers mailing list