[torqueusers] jobs stuck in "W" state
David Golden
dgolden at cp.dias.ie
Thu Feb 23 10:17:43 MST 2006
On 2006-02-23 16:41:45 +0000, David Golden wrote>
> If file stage-in fails at job start, the job is postponed for
> half an hour and an email sent to the user
> rather than the job being totally removed from the queue.
> But once exec_host becomes set, it just keeps trying the
> same node again (at least for torque-1.2.0p6). ISTR discussions
> of more flexible behaviour a while back.
>
Self-replying with refs:
The stage-in behaviour was actually mentioned days ago on-list:
http://www.clusterresources.com/pipermail/torqueusers/2006-February/003202.html
Would be nice (tm) if there was an option to simply
have the job rejected if stage-in fails.
Related to the "sticky exec_host" thing:
http://www.supercluster.org/pipermail/torqueusers/2005-September/002130.html
More information about the torqueusers
mailing list