[torqueusers] Epilogue script

pat.o'bryant at exxonmobil.com pat.o'bryant at exxonmobil.com
Mon Aug 28 06:35:12 MDT 2006


Diego,
     With our initial install of Torque and Maui, we had problems with
parallel jobs terminating and leaving behind shared memory and semaphores.
We coded an epilogue script to remove these artifacts. I would think that
something within Torque that would do this same kind of cleanup would be
very useful.
        Thanks,
         Pat



J.W. (Pat) O'Bryant,Jr.
Business Line Infrastructure
Technical Systems, HPC
Office: 713-431-7022
Pager: 713-606-8338


                                                                           
             "Diego M.                                                     
             Vadell"                                                       
             <dvadell at linux                                             To 
             clusters.com.a           Dave Jackson                         
             r>                       <jacksond at clusterresources.com>      
             Sent by:                                                   cc 
             torqueusers-bo           torquedev at supercluster.org,          
             unces at superclu           torqueusers at supercluster.org         
             ster.org                                              Subject 
                                      Re: [torqueusers] Epilogue script    
                                                                           
             08/25/06 05:36                                                
             PM                                                            
                                                                           
                                                                           
                                                                           




Hi Dave,
   At first sight it looks like a good solution to a problem I have and had

with many installations, and I don't find any negatives. I haven't tried it

yet (I'll do it tomorrow) so I don't feel confident making any statement
about it.

    I like very much your idea of integrating it into torque. The problem
that
it tries to solve is pretty common, at least in my little experience with
clusters. But I'm not an expert in torque or batch systems, so maybe
somebody
else can comment?

Greetings,
 -- Diego.

On Friday 25 August 2006 16:13, Dave Jackson wrote:
> Diego,
>
>   What would be the negatives of enabling this feature in a much more
> integrated manner?  ie, both mother superior and sister moms have a
> config option 'cleanup_procs = true' which if true will search the
> process tree for processors owned by user X with a matching job id in
> the environment.  pbs_mom could then terminate all of these processes
> directly.  This would make this feature much easier for most sites to
> activate.  No epilog/prolog creation, no compiling, simply set a
> parameter.  And as you mention, it would work in both dedicated and
> shared node operation.
>
>   Thoughts?
>
> Dave
>
> On Thu, 2006-08-24 at 00:04 -0300, Diego M. Vadell wrote:
> > Maybe an alternative: the epilogue script in
> > http://bellatrix.pcl.ox.ac.uk/~ben/pbs/ :
> >
> > " When running parallel jobs on Linux clusters with MPICH and PBS,
> > "slave" MPICH processes are often left behind on one or nodes at job
> > abortion. PBS makes no attempt to clean up processes on any node except
> > the master node, and so these processes can linger for some time. The
> > approach used at the WGR/PDL lab is to kill these processes by means of
a
> > second MPI-enabled program, which is run on the same set of nodes that
> > the main job was run on, by the PBS epilogue facility. This program
kills
> > all of the user's processes that have the relevant PBS job ID in their
> > environment, so should leave other jobs on the same machine untouched.
To
> > set up this system, this C program should be compiled with mpicc and
> > installed as /usr/local/bin/mpicleanup on every MPI node. This epilogue
> > script should then be used by PBS on every node (usually it needs to be
> > installed as /usr/spool/PBS/mom_priv/epilogue) to call the MPICH
cleanup
> > program properly at job termination."
> >
> > I haven't tried it yet.
> >
> > Hope it helps,
> >  -- Diego.
> >
> > On Tuesday 22 August 2006 11:42, Cliff Kirby wrote:
> > > I currently use an epilogue script to kill all the PIDs of the user
but
> > > that is not the best solution.  Tracking down the child processes of
an
> > > mpirun parallel job is not an easy task because each cluster system
> > > participating in the parallel job creates unique PID's for the job.
> > > I hope your question is answered because I am want the same thing you
> > > do.
> > >
> > > - Cliff
> > >
> > > -----Original Message-----
> > > From: torqueusers-bounces at supercluster.org
> > > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Eugene van
> > > den Hurk
> > > Sent: Tuesday, August 22, 2006 4:37 AM
> > > To: torqueusers at supercluster.org
> > > Subject: [torqueusers] Epilogue script
> > >
> > >
> > > Hello,
> > >
> > > I am looking at implementing torque on our cluster.
> > > I have been looking at using an epilogue script to clean up after
> > > jobs, particularly if the job is aborted or deleted.
> > > This seems to be particularly needed in the case when running jobs
> > > using mpich and mpirun.
> > > I have looked at using mpiexec instead of mpirun. I installed mpiexec
> > > and it seems to work fine.
> > > Can anyone think of any reason why using mpiexec instead of mpirun is
> > > a bad idea?
> > > If I use mpiexec instead of mpirun would I be right in thinking that
> > > it still a good idea to use epilogue
> > > scripts for other types of jobs.
> > > Each node is dual processor so I do not want to kill processes based
> > > on username, as a user may have more than one job on a node.
> > > So it looks like I would have to use a script that would be able to
> > > kill orphaned processes based on job id.
> > > Would anyone have any suggestions as to how I could do this or sample
> > > scripts that I could try?
> > > Any help would be greatly appreciated.
> > >
> > > Thanks,
> > > Regards,
> > > Eugene.
> > >
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > >
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers




More information about the torqueusers mailing list