[torquedev] Re: [torqueusers] Epilogue script
Dave Jackson
jacksond at clusterresources.com
Fri Aug 25 13:13:49 MDT 2006
Diego,
What would be the negatives of enabling this feature in a much more
integrated manner? ie, both mother superior and sister moms have a
config option 'cleanup_procs = true' which if true will search the
process tree for processors owned by user X with a matching job id in
the environment. pbs_mom could then terminate all of these processes
directly. This would make this feature much easier for most sites to
activate. No epilog/prolog creation, no compiling, simply set a
parameter. And as you mention, it would work in both dedicated and
shared node operation.
Thoughts?
Dave
On Thu, 2006-08-24 at 00:04 -0300, Diego M. Vadell wrote:
> Maybe an alternative: the epilogue script in
> http://bellatrix.pcl.ox.ac.uk/~ben/pbs/ :
>
> " When running parallel jobs on Linux clusters with MPICH and PBS, "slave"
> MPICH processes are often left behind on one or nodes at job abortion. PBS
> makes no attempt to clean up processes on any node except the master node,
> and so these processes can linger for some time. The approach used at the
> WGR/PDL lab is to kill these processes by means of a second MPI-enabled
> program, which is run on the same set of nodes that the main job was run on,
> by the PBS epilogue facility. This program kills all of the user's processes
> that have the relevant PBS job ID in their environment, so should leave other
> jobs on the same machine untouched. To set up this system, this C program
> should be compiled with mpicc and installed as /usr/local/bin/mpicleanup on
> every MPI node. This epilogue script should then be used by PBS on every node
> (usually it needs to be installed as /usr/spool/PBS/mom_priv/epilogue) to
> call the MPICH cleanup program properly at job termination."
>
> I haven't tried it yet.
>
> Hope it helps,
> -- Diego.
>
> On Tuesday 22 August 2006 11:42, Cliff Kirby wrote:
> > I currently use an epilogue script to kill all the PIDs of the user but
> > that is not the best solution. Tracking down the child processes of an
> > mpirun parallel job is not an easy task because each cluster system
> > participating in the parallel job creates unique PID's for the job.
> > I hope your question is answered because I am want the same thing you do.
> >
> > - Cliff
> >
> > -----Original Message-----
> > From: torqueusers-bounces at supercluster.org
> > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Eugene van den
> > Hurk
> > Sent: Tuesday, August 22, 2006 4:37 AM
> > To: torqueusers at supercluster.org
> > Subject: [torqueusers] Epilogue script
> >
> >
> > Hello,
> >
> > I am looking at implementing torque on our cluster.
> > I have been looking at using an epilogue script to clean up after
> > jobs, particularly if the job is aborted or deleted.
> > This seems to be particularly needed in the case when running jobs
> > using mpich and mpirun.
> > I have looked at using mpiexec instead of mpirun. I installed mpiexec
> > and it seems to work fine.
> > Can anyone think of any reason why using mpiexec instead of mpirun is
> > a bad idea?
> > If I use mpiexec instead of mpirun would I be right in thinking that
> > it still a good idea to use epilogue
> > scripts for other types of jobs.
> > Each node is dual processor so I do not want to kill processes based
> > on username, as a user may have more than one job on a node.
> > So it looks like I would have to use a script that would be able to
> > kill orphaned processes based on job id.
> > Would anyone have any suggestions as to how I could do this or sample
> > scripts that I could try?
> > Any help would be greatly appreciated.
> >
> > Thanks,
> > Regards,
> > Eugene.
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torquedev
mailing list