[torqueusers] Jobs stuck in Queue
Joshua Bernstein
jbernstein at penguincomputing.com
Thu Oct 4 13:51:07 MDT 2007
Hello All,
I'm having a problem handling running MPI based jobs linked against a
MPICH under TORQUE
The problem is this, in my jobs script, I try to start an MPI job in the
same why I would outside TORQUE:
---
#PBS -j oe
<code to set BEOWULF_JOB_MAP based on PBS_NODEFILE>
exec ./mpijob
---
This of course correctly starts the jobs on the nodes, but if I do a
qdel, to kill the job, the job leaves the TORQUE queue, but the
processes still stay on the nodes. This behavior has lead me to use mpiexec.
So, if I use mpiexec a la:
---
#PBS -j oe
<code to set BEOWULF_JOB_MAP based on PBS_NODEFILE>
mpiexec -comm none ./mpijob
---
The jobs again, start properly on the nodes (albeit a bit slower), and
then when I do a qdel, the processes get properly cleaned off the nodes.
The trouble here is that the job still shows up in the TORQUE queue
marked as running. The only way to clean up this job is to remove its
entries from $PBS_HOME/server_priv/job and from $PBS_HOME/mom_priv/jobs
Any ideas to help point me in the right direction?
-Joshua Bernstein
Software Engineer
Penguin Computing
More information about the torqueusers
mailing list