[torqueusers] jobs stuck in "W" state

David Golden dgolden at cp.dias.ie
Thu Feb 23 09:41:45 MST 2006

On 2006-02-23 16:01:16 +0000, gianfranco sciacca wrote:
> Hello torqueusers,
> I have currently in one our queues a number of jobs that qstat reports
> in "W" state since a couple of days. On closer examination, qstat -f
> shows that each has an assigned execution node "exec_host".

Coincidentally, something like that just
happened on our cluster, for a job that needed to stage in a 
file to a path that was a dangling (invalid) symlink on an
ill node.

If file stage-in fails at job start, the job is postponed for
half an hour and an email sent to the user
rather than the job being totally removed from the queue.
But once exec_host becomes set, it just keeps trying the
same node again (at least for torque-1.2.0p6). ISTR discussions
of more flexible behaviour a while back.

More information about the torqueusers mailing list