[torqueusers] Epilogue script

Dave Jackson jacksond at clusterresources.com
Fri Aug 25 13:13:49 MDT 2006


Diego,

  What would be the negatives of enabling this feature in a much more
integrated manner?  ie, both mother superior and sister moms have a
config option 'cleanup_procs = true' which if true will search the
process tree for processors owned by user X with a matching job id in
the environment.  pbs_mom could then terminate all of these processes
directly.  This would make this feature much easier for most sites to
activate.  No epilog/prolog creation, no compiling, simply set a
parameter.  And as you mention, it would work in both dedicated and
shared node operation.

  Thoughts?

Dave

On Thu, 2006-08-24 at 00:04 -0300, Diego M. Vadell wrote:
> Maybe an alternative: the epilogue script in 
> http://bellatrix.pcl.ox.ac.uk/~ben/pbs/ :
> 
> " When running parallel jobs on Linux clusters with MPICH and PBS, "slave" 
> MPICH processes are often left behind on one or nodes at job abortion. PBS 
> makes no attempt to clean up processes on any node except the master node, 
> and so these processes can linger for some time. The approach used at the 
> WGR/PDL lab is to kill these processes by means of a second MPI-enabled 
> program, which is run on the same set of nodes that the main job was run on, 
> by the PBS epilogue facility. This program kills all of the user's processes 
> that have the relevant PBS job ID in their environment, so should leave other 
> jobs on the same machine untouched. To set up this system, this C program 
> should be compiled with mpicc and installed as /usr/local/bin/mpicleanup on 
> every MPI node. This epilogue script should then be used by PBS on every node 
> (usually it needs to be installed as /usr/spool/PBS/mom_priv/epilogue) to 
> call the MPICH cleanup program properly at job termination."
> 
> I haven't tried it yet.
> 
> Hope it helps,
>  -- Diego.
> 
> On Tuesday 22 August 2006 11:42, Cliff Kirby wrote:
> > I currently use an epilogue script to kill all the PIDs of the user but
> > that is not the best solution.  Tracking down the child processes of an
> > mpirun parallel job is not an easy task because each cluster system
> > participating in the parallel job creates unique PID's for the job.
> > I hope your question is answered because I am want the same thing you do.
> >
> > - Cliff
> >
> > -----Original Message-----
> > From: torqueusers-bounces at supercluster.org
> > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Eugene van den
> > Hurk
> > Sent: Tuesday, August 22, 2006 4:37 AM
> > To: torqueusers at supercluster.org
> > Subject: [torqueusers] Epilogue script
> >
> >
> > Hello,
> >
> > I am looking at implementing torque on our cluster.
> > I have been looking at using an epilogue script to clean up after
> > jobs, particularly if the job is aborted or deleted.
> > This seems to be particularly needed in the case when running jobs
> > using mpich and mpirun.
> > I have looked at using mpiexec instead of mpirun. I installed mpiexec
> > and it seems to work fine.
> > Can anyone think of any reason why using mpiexec instead of mpirun is
> > a bad idea?
> > If I use mpiexec instead of mpirun would I be right in thinking that
> > it still a good idea to use epilogue
> > scripts for other types of jobs.
> > Each node is dual processor so I do not want to kill processes based
> > on username, as a user may have more than one job on a node.
> > So it looks like I would have to use a script that would be able to
> > kill orphaned processes based on job id.
> > Would anyone have any suggestions as to how I could do this or sample
> > scripts that I could try?
> > Any help would be greatly appreciated.
> >
> > Thanks,
> > Regards,
> > Eugene.
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list