[torqueusers] BUG: MOM segfaults

Chris Samuel csamuel at vpac.org
Tue Feb 1 17:35:55 MST 2005


/* CC'd to the mpiexec mailing list for Pete to comment on */

On Wed, 2 Feb 2005 10:44 am, Garrick Staples wrote:

> On Tue, Feb 01, 2005 at 10:50:22AM -0700, Marc Aurele La France alleged:
> > Hi.
> >
> > init_abort_job() in src/resmom/catch_child.c contains a call to
> > set_globid(pj,NULL).  Consequently, it behooves all set_globid()
>
> I looked through the code a bit and began to doubt whether that
> set_globid() call in init_abort_job() was actually required.  The comment
> says it came from the mpiexec patch.  I commented it out and was able to
> run new jobs with mpiexec just fine.
>
> Anyone from the mpiexec crowd know about this and can comment?

This has come from the mpiexec patch against OpenPBS 2.3.12 to stop a 
restarting mom from killing a job launched by mpiexec:

 mpiexec-0.77/patch/pbs-2.3.12-mom-restart.diff

The relevant fragment in that patch says:

 /* set the globid so mom does not coredump in response
  * to tm_spawn */
 set_globid(pj, 0);

This patch and a description is listed in Pete's collection of OpenPBS patches 
at:

 http://www.osc.edu/~pw/pbs/

It says:

mom-restart.patch - Track running jobs properly across a mom restart. 

For mpiexec-spawned jobs to survive across a mom restart, and to enable proper 
accounting for all jobs which continue across a mom restart, this patch fixes 
some behavior of mom when restarted with the "-p" flag. Note that this patch 
adds functionality to the machine-specific part of the mom code for linux 
only. Users of other system types could cut-n-paste that code without too 
much problem, but as it stands, this patch will break compilation on 
non-linux systems. 

This patch does four things: 

- Fix coredump resulting from tm_spawn to restarted pbs_mom 
- Avoid race condition by which pbs_mom would sometimes kill itself as tasks 
exit. 
- Make a restarted pbs_mom search for and report exiting tasks from jobs which 
were started before the old mom was killed. 
- Change response of pbs_mom to various signals. Now the default is to leave 
all jobs running. If you want to stop all jobs, USR1 can be used to achieve 
the old behavior.


cheers!
Chris
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050202/cdc07c90/attachment.bin


More information about the torqueusers mailing list