[torqueusers] Jobs getting "stuck"
jkusznir at gmail.com
Mon Jan 9 11:52:09 MST 2012
I have an issue where some of my users are running jobs that get
"stuck". In this case, "stuck" means that the job ends, but torque
doesn't know. It shows the job as still running. Eventually, the
walltime runs out and torque tries to kill the job, but does not
remove it from the job list. It does send an e-mail to the owner
notifying them that the job has exceeded walltime. Then a few minutes
later, it e-mails and tries again to kill. This continues until I use
the qdel -p <jobid> command on it.
One user seems to have it happen to the majority of his jobs; a few
others have had it happen to theirs. I haven't found a pattern yet;
some jobs are spawned through OpenMPI (which has torque integration);
others are non-MPI jobs (multi-threaded single-process or even just
single-process jobs). about 80% of the jobs do end correctly, but
there's that rather large percentage that I still have to purge by
What causes this? What can I / my users do to fix this?
More information about the torqueusers