[torqueusers] Torque behavior with failed nodes
garrick at usc.edu
Fri Jul 29 16:01:18 MDT 2005
I just tested this with torque p5 and something got broken. MS is
actively killing jobs when a sister mom is down.
On Fri, Jul 29, 2005 at 09:38:34AM -0600, Dave Jackson alleged:
> The responsibility for handling job migration on node failure
> detection belongs with the scheduler. If using Moab, the parameter
> 'JOBACTIONONNODEFAILURE' can be set to a value such as requeue or cancel
> to inform the scheduler to requeue and restart the job, or just
> terminate it to free up the non-failed nodes which were allocated.
> Maui provides a warning about this issue but does not automatically
> take action (use 'mdiag -j'). pbs_sched doesn't seem to do anything.
> Let us know if there is further information we can provide.
> On Thu, 2005-07-28 at 22:51 -0400, Pradeep Padala wrote:
> > Hi,
> > I am trying to understand Torque's behavior when a node fails. I am
> > checking the source, and I understand that check_nodes marks the node as
> > down by setting the node state to INUSE_DOWN, but I don't see any code
> > to move the jobs to somewhere else. What happens to the jobs running on
> > that node? Will the scheduler be told about the failed node?
> > Any input is greatly appreciated.
> > Thanks,
> torqueusers mailing list
> torqueusers at supercluster.org
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050729/b12d13d6/attachment.bin
More information about the torqueusers