[torqueusers] Re: Epilogue script

Cliff Kirby ckirby3 at colsa.com
Tue Aug 29 09:26:37 MDT 2006


This is one approach I am testing now.  The C program checks the
/proc/<pid>/environ file on the mother superior for the MPI processes (I
think) but it is only relevant to Linux and AIX in our lab.

----------------------------------------------------------------------------
--------

Maybe an alternative: the epilogue script in 
http://bellatrix.pcl.ox.ac.uk/~ben/pbs/ :

" When running parallel jobs on Linux clusters with MPICH and PBS, "slave" 
MPICH processes are often left behind on one or nodes at job abortion. PBS 
makes no attempt to clean up processes on any node except the master node, 
and so these processes can linger for some time. The approach used at the 
WGR/PDL lab is to kill these processes by means of a second MPI-enabled 
program, which is run on the same set of nodes that the main job was run on,

by the PBS epilogue facility. This program kills all of the user's processes

that have the relevant PBS job ID in their environment, so should leave
other 
jobs on the same machine untouched. To set up this system, this C program 
should be compiled with mpicc and installed as /usr/local/bin/mpicleanup on 
every MPI node. This epilogue script should then be used by PBS on every
node 
(usually it needs to be installed as /usr/spool/PBS/mom_priv/epilogue) to 
call the MPICH cleanup program properly at job termination."

----------------------------------------------------------------------------
-------  

I have also played around with
http://svn.osc.edu/repos/pbstools/trunk/sbin/reaver but it fails on the
bigger jobs for the qstat call.  Maybe too many requests at once?  

I will include my vote to see this implemented on all the MOM architectures
including OSX.


-----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Garrick Staples
Sent: Monday, August 28, 2006 7:43 PM
To: Dave Jackson
Cc: torqueusers at supercluster.org
Subject: Re: [torqueusers] Re: Epilogue script


Is there a cross-platform way to search the env of processes?  It seems like
this will have to be implemented seperately for each MOM arch.

On Mon, Aug 28, 2006 at 05:07:11PM -0600, Dave Jackson alleged:
> Glen,
> 
>   I believe there is the possibility of negative side affects but the 
> likelihood of this is immensely small.  A user would need to 
> inadvertently set a specific environment variable to a specific value 
> to have an issue.  This does not happen in the real world and if it 
> does, this feature is configurable and is off by default.
> 
>   I also believe there are exceptional cases in which it would not 
> work. But these are not the majority.  I think we have a capability 
> which would easily and immediately benefit many sites.  While this 
> capability does not cover 100% of cases, it definitely makes things 
> better for most.  Weighing pros and cons, I think this feature is 
> clearly worth it.
> 
> Dave
> 
> On Mon, 2006-08-28 at 18:49 -0400, Glen Beane wrote:
> > I think I agree with Garrick on this one.
> > 
> > On 8/28/06, Garrick Staples <garrick at clusterresources.com> wrote:
> > > I'm really uncomfortable with pbs_mom killing off processes that 
> > > aren't under its control.  Even though looking for a jobid env var 
> > > seems like a reasonable assumption, I'm sure it will break someone 
> > > somewhere.
> > >
> > > This sounds like a site-specific assumption that is easily, and 
> > > sanely, handled in epilogue.
> > >
> > > Perhaps this just belongs in the Wiki.
> > >
> > >
> > > On Mon, Aug 28, 2006 at 11:43:15AM -0400, Andrew Keen alleged:
> > > > Dave,
> > > >
> > > > This feature would be very useful to us as we often have this 
> > > > problem (although not as often since we've migrated to using 
> > > > OSU's mpiexec instead of mpirun).
> > > >
> > > > -Andy
> > > >
> > > > torqueusers-request at supercluster.org wrote:
> > > > >
> > > > >   1. Re: Epilogue script (Dave Jackson)
> > > > >   2. Re: Epilogue script (Diego M. Vadell)
> > > > >
> > > > >
> > > > >---------------------------------------------------------------
> > > > >-------
> > > > >
> > > > >Message: 1
> > > > >Date: Fri, 25 Aug 2006 13:13:49 -0600
> > > > >From: Dave Jackson <jacksond at clusterresources.com>
> > > > >Subject: Re: [torqueusers] Epilogue script
> > > > >To: "Diego M. Vadell" <dvadell at linuxclusters.com.ar>
> > > > >Cc: torquedev at supercluster.org, torqueusers at supercluster.org
> > > > >Message-ID: <1156533229.10669.77.camel at koa.icluster.org>
> > > > >Content-Type: text/plain
> > > > >
> > > > >Diego,
> > > > >
> > > > >  What would be the negatives of enabling this feature in a 
> > > > >much more integrated manner?  ie, both mother superior and 
> > > > >sister moms have a config option 'cleanup_procs = true' which 
> > > > >if true will search the process tree for processors owned by 
> > > > >user X with a matching job id in the environment.  pbs_mom 
> > > > >could then terminate all of these processes directly.  This 
> > > > >would make this feature much easier for most sites to activate.  
> > > > >No epilog/prolog creation, no compiling, simply set a 
> > > > >parameter.  And as you mention, it would work in both dedicated 
> > > > >and shared node operation.
> > > > >
> > > > >  Thoughts?
> > > > >
> > > > >Dave
> > > > >
> > > >
> > > > _______________________________________________
> > > > torqueusers mailing list
> > > > torqueusers at supercluster.org 
> > > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org 
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org 
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org 
> http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list