[torqueusers] Torque/maui node failure policy revisted again

Glen Beane glen.beane at gmail.com
Tue Dec 16 10:11:32 MST 2008


On Tue, Dec 16, 2008 at 11:24 AM, Craig West <cwest at astro.umass.edu> wrote:
>
> There is the case when you are launching multiple serial jobs from a single
> script. pbsdsh is useful at that.
>
> One problem comes to mind and that is what happens if the first node in the
> list (exec_host) is the node that goes MIA? Would this mean that torque
> would loose control and tracking of the running jobs?
> Perhaps if any node but the first node fails then the job should continue to
> run, if the first node dies the job should be terminated???

Yes, even if we implemented this feature if the first node dies the
job will be terminated.  And like I said I think this should be a per
job option that defaults to the current behavior of terminating the
job if any node is lost.  It should also be possible to set the
default on a server or queue basis.


More information about the torqueusers mailing list