[torqueusers] Torque/maui node failure policy revisted again
glen.beane at gmail.com
Tue Dec 16 10:11:32 MST 2008
On Tue, Dec 16, 2008 at 11:24 AM, Craig West <cwest at astro.umass.edu> wrote:
> There is the case when you are launching multiple serial jobs from a single
> script. pbsdsh is useful at that.
> One problem comes to mind and that is what happens if the first node in the
> list (exec_host) is the node that goes MIA? Would this mean that torque
> would loose control and tracking of the running jobs?
> Perhaps if any node but the first node fails then the job should continue to
> run, if the first node dies the job should be terminated???
Yes, even if we implemented this feature if the first node dies the
job will be terminated. And like I said I think this should be a per
job option that defaults to the current behavior of terminating the
job if any node is lost. It should also be possible to set the
default on a server or queue basis.
More information about the torqueusers