[torqueusers] Epilogue script
troy at osc.edu
Tue Aug 22 08:40:42 MDT 2006
On Tue, 2006-08-22 at 10:37 +0100, Eugene van den Hurk wrote:
> I am looking at implementing torque on our cluster.
> I have been looking at using an epilogue script to clean up after
> jobs, particularly if the job is aborted or deleted.
> This seems to be particularly needed in the case when running jobs
> using mpich and mpirun.
> I have looked at using mpiexec instead of mpirun. I installed mpiexec
> and it seems to work fine.
> Can anyone think of any reason why using mpiexec instead of mpirun is
> a bad idea?
None that I can think of, but then again I'm biased. :)
The things we've run into at OSC that sometimes can make mpiexec
problematic are ISV codes that are compiled statically against MPICH/p4
and insist on invoking mpirun under the covers. The parallel version of
Turbomole is the one with which I remember having the most trouble, but
I think the parallel version of Abaqus works like this as well. Pete W.
developed a mpirun "replacement" for that took all the well-known MPICH
mpirun arguments, ignored most of them, and invoked mpiexec. However, I
don't know if he's made that part of the mpiexec distribution or not.
> If I use mpiexec instead of mpirun would I be right in thinking that
> it still a good idea to use epilogue
> scripts for other types of jobs.
> Each node is dual processor so I do not want to kill processes based
> on username, as a user may have more than one job on a node.
> So it looks like I would have to use a script that would be able to
> kill orphaned processes based on job id.
> Would anyone have any suggestions as to how I could do this or sample
> scripts that I could try?
Take a look at the reaver script in the pbstools SVN head:
It's designed to identify (and optionally kill) processes which are
*not* owned by either system userids or userids with jobs assigned to
the node, but you can give it a list of jobids to clean up as well.
I haven't done a release of pbstools in a while... [makes mental note
to do one soon]
Troy Baer troy at osc.edu
Science & Technology Support http://www.osc.edu/hpc/
Ohio Supercomputer Center 614-292-9701
More information about the torqueusers