[torqueusers] Jobs stuck in Queue

Garrick Staples garrick at usc.edu
Thu Oct 4 17:02:11 MDT 2007


On Thu, Oct 04, 2007 at 12:51:07PM -0700, Joshua Bernstein alleged:
> Hello All,
> 
> I'm having a problem handling running MPI based jobs linked against a 
> MPICH under TORQUE
> 
> The problem is this, in my jobs script, I try to start an MPI job in the 
> same why I would outside TORQUE:
> 
> ---
> #PBS -j oe
> <code to set BEOWULF_JOB_MAP based on PBS_NODEFILE>
> exec ./mpijob
> ---

exec?  Why replace the top-level shell process?

 
> This of course correctly starts the jobs on the nodes, but if I do a 
> qdel, to kill the job, the job leaves the TORQUE queue, but the 
> processes still stay on the nodes. This behavior has lead me to use mpiexec.

At least the processes on the MS node are killed, right?

 
> So, if I use mpiexec a la:
> 
> ---
> #PBS -j oe
> <code to set BEOWULF_JOB_MAP based on PBS_NODEFILE>
> mpiexec -comm none ./mpijob
> ---

comm none?  That's only for non-MPI programs.

 
> The jobs again, start properly on the nodes (albeit a bit slower), and 
> then when I do a qdel, the processes get properly cleaned off the nodes. 
> The trouble here is that the job still shows up in the TORQUE queue 
> marked as running. The only way to clean up this job is to remove its 
> entries from $PBS_HOME/server_priv/job and from $PBS_HOME/mom_priv/jobs

First, manually deleting files is bad.  If you really must purge jobs use
'momctl -c' to clear it from the node, and 'qdel -p' to clear it from the
server.  That said, never use those commands!

If you look in pbs_mom's log file, you'll probably find an error message
related to not being able to talk to the server.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20071004/4069e1b2/attachment.bin


More information about the torqueusers mailing list