[torqueusers] Jobs stuck in Queue

Joshua Bernstein jbernstein at penguincomputing.com
Thu Oct 4 13:51:07 MDT 2007


Hello All,

I'm having a problem handling running MPI based jobs linked against a 
MPICH under TORQUE

The problem is this, in my jobs script, I try to start an MPI job in the 
same why I would outside TORQUE:

---
#PBS -j oe
<code to set BEOWULF_JOB_MAP based on PBS_NODEFILE>
exec ./mpijob
---

This of course correctly starts the jobs on the nodes, but if I do a 
qdel, to kill the job, the job leaves the TORQUE queue, but the 
processes still stay on the nodes. This behavior has lead me to use mpiexec.

So, if I use mpiexec a la:

---
#PBS -j oe
<code to set BEOWULF_JOB_MAP based on PBS_NODEFILE>
mpiexec -comm none ./mpijob
---

The jobs again, start properly on the nodes (albeit a bit slower), and 
then when I do a qdel, the processes get properly cleaned off the nodes. 
The trouble here is that the job still shows up in the TORQUE queue 
marked as running. The only way to clean up this job is to remove its 
entries from $PBS_HOME/server_priv/job and from $PBS_HOME/mom_priv/jobs

Any ideas to help point me in the right direction?

-Joshua Bernstein
Software Engineer
Penguin Computing


More information about the torqueusers mailing list