[torqueusers] jobs stuck in "W" state

David Golden dgolden at cp.dias.ie
Thu Feb 23 10:17:43 MST 2006


On 2006-02-23 16:41:45 +0000, David Golden wrote> 
> If file stage-in fails at job start, the job is postponed for
> half an hour and an email sent to the user
> rather than the job being totally removed from the queue.
> But once exec_host becomes set, it just keeps trying the
> same node again (at least for torque-1.2.0p6). ISTR discussions
> of more flexible behaviour a while back.
>

Self-replying with refs:

The stage-in behaviour was actually mentioned days ago on-list:
http://www.clusterresources.com/pipermail/torqueusers/2006-February/003202.html

Would be nice (tm) if there was an option to simply
have the job rejected if stage-in fails.  

Related to the "sticky exec_host" thing:
http://www.supercluster.org/pipermail/torqueusers/2005-September/002130.html



More information about the torqueusers mailing list