[torqueusers] Torque behavior with failed nodes

Dave Jackson jacksond at clusterresources.com
Fri Jul 29 09:38:34 MDT 2005


Pradeep,

  The responsibility for handling job migration on node failure
detection belongs with the scheduler.  If using Moab, the parameter
'JOBACTIONONNODEFAILURE' can be set to a value such as requeue or cancel
to inform the scheduler to requeue and restart the job, or just
terminate it to free up the non-failed nodes which were allocated.

  Maui provides a warning about this issue but does not automatically
take action (use 'mdiag -j').  pbs_sched doesn't seem to do anything.  

  Let us know if there is further information we can provide.

Dave

On Thu, 2005-07-28 at 22:51 -0400, Pradeep Padala wrote:
> Hi,
>     I am trying to understand Torque's behavior when a node fails. I am 
> checking the source, and I understand that check_nodes marks the node as 
> down by setting the node state to INUSE_DOWN, but I don't see any code 
> to move the jobs to somewhere else. What happens to the jobs running on 
> that node? Will the scheduler be told about the failed node?
> 
>     Any input is greatly appreciated.
> 
> Thanks,



More information about the torqueusers mailing list