[torqueusers] Torque behavior with failed nodes

Dave Jackson jacksond at clusterresources.com
Fri Jul 29 11:00:32 MDT 2005


Pradeep,

  Are you killing the mother superior or a child node.  I believe there
are still issues if you kill the mother superior with TORQUE releasing
the job for re-execution on another node.  Contributing sites have
already improved TORQUE to allow a job running on a failed mother
superior to be canceled.  I believe similar changes can be made to allow
jobs running on a failed mother superior to be requeued.

  We will check to see what the next steps are on this front.  Can you
verify that everything works if a child node fails?

Dave

On Fri, 2005-07-29 at 12:43 -0400, Pradeep Padala wrote:
> Hi Dave,
>     Thanks for the reply. I basically want to do this.
> *) Submit a job
> *) Jobs is running on a particular node
> *) Kill the pbs_mom on that node
> *) Move the job to a new node
> *) Restart
> 
>     It doesn't seem like Torque+Maui can support this directly. So, I am 
> doing this.
> 
>     I have two machines in my cluster.
> *) Submit a job
> *) qhold jobid
> *) qrerun jobid
>     At this stage the job is put into a hold state
> *) I kill the pbs_mom on the first node and wait till PBS notices
>     that the node is down
> *) qalter -l neednodes= jobid
> *) qrls jobid
> 
>     Now, it doesn't run and a checkjob -v reveals that
> 
> job is deferred.  Reason:  RMFailure  (cannot start job - RM failure, 
> rc: 15041, msg: 'Execution server rejected request MSG=send failed, 
> STARTING')
> Holds:    Defer  (hold reason:  RMFailure)
> 
>     It's clear that Torque is still trying to run in it on the failed 
> node. What am I doing wrong? Is there a better way to do this?
> 
> Pradeep
> 
> Dave Jackson wrote:
> > Pradeep,
> > 
> >   The responsibility for handling job migration on node failure
> > detection belongs with the scheduler.  If using Moab, the parameter
> > 'JOBACTIONONNODEFAILURE' can be set to a value such as requeue or cancel
> > to inform the scheduler to requeue and restart the job, or just
> > terminate it to free up the non-failed nodes which were allocated.
> > 
> >   Maui provides a warning about this issue but does not automatically
> > take action (use 'mdiag -j').  pbs_sched doesn't seem to do anything.  
> > 
> >   Let us know if there is further information we can provide.
> > 
> > Dave
> > 
> > On Thu, 2005-07-28 at 22:51 -0400, Pradeep Padala wrote:
> > 
> >>Hi,
> >>    I am trying to understand Torque's behavior when a node fails. I am 
> >>checking the source, and I understand that check_nodes marks the node as 
> >>down by setting the node state to INUSE_DOWN, but I don't see any code 
> >>to move the jobs to somewhere else. What happens to the jobs running on 
> >>that node? Will the scheduler be told about the failed node?
> >>
> >>    Any input is greatly appreciated.
> >>
> >>Thanks,



More information about the torqueusers mailing list