[torqueusers] Jobs stuck in Queue
Garrick Staples
garrick at usc.edu
Thu Oct 4 17:02:11 MDT 2007
On Thu, Oct 04, 2007 at 12:51:07PM -0700, Joshua Bernstein alleged:
> Hello All,
>
> I'm having a problem handling running MPI based jobs linked against a
> MPICH under TORQUE
>
> The problem is this, in my jobs script, I try to start an MPI job in the
> same why I would outside TORQUE:
>
> ---
> #PBS -j oe
> <code to set BEOWULF_JOB_MAP based on PBS_NODEFILE>
> exec ./mpijob
> ---
exec? Why replace the top-level shell process?
> This of course correctly starts the jobs on the nodes, but if I do a
> qdel, to kill the job, the job leaves the TORQUE queue, but the
> processes still stay on the nodes. This behavior has lead me to use mpiexec.
At least the processes on the MS node are killed, right?
> So, if I use mpiexec a la:
>
> ---
> #PBS -j oe
> <code to set BEOWULF_JOB_MAP based on PBS_NODEFILE>
> mpiexec -comm none ./mpijob
> ---
comm none? That's only for non-MPI programs.
> The jobs again, start properly on the nodes (albeit a bit slower), and
> then when I do a qdel, the processes get properly cleaned off the nodes.
> The trouble here is that the job still shows up in the TORQUE queue
> marked as running. The only way to clean up this job is to remove its
> entries from $PBS_HOME/server_priv/job and from $PBS_HOME/mom_priv/jobs
First, manually deleting files is bad. If you really must purge jobs use
'momctl -c' to clear it from the node, and 'qdel -p' to clear it from the
server. That said, never use those commands!
If you look in pbs_mom's log file, you'll probably find an error message
related to not being able to talk to the server.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20071004/4069e1b2/attachment.bin
More information about the torqueusers
mailing list