[torqueusers] Torque behavior with failed nodes

Dave Jackson jacksond at clusterresources.com
Fri Jul 29 16:10:45 MDT 2005


Garrick,

  We will need to compare notes.  We have actually been testing this for
over a week and saw no instance of MS 'fratricide'.  (Would this be
matricide?)

  We will look at this further and try to get a failure.  Can you send
us level 7 MS and Sister Mom logs?

Thanks,
Dave

On Fri, 2005-07-29 at 15:01 -0700, Garrick Staples wrote:
> I just tested this with torque p5 and something got broken.  MS is
> actively killing jobs when a sister mom is down.
> 
> On Fri, Jul 29, 2005 at 09:38:34AM -0600, Dave Jackson alleged:
> > Pradeep,
> > 
> >   The responsibility for handling job migration on node failure
> > detection belongs with the scheduler.  If using Moab, the parameter
> > 'JOBACTIONONNODEFAILURE' can be set to a value such as requeue or cancel
> > to inform the scheduler to requeue and restart the job, or just
> > terminate it to free up the non-failed nodes which were allocated.
> > 
> >   Maui provides a warning about this issue but does not automatically
> > take action (use 'mdiag -j').  pbs_sched doesn't seem to do anything.  
> > 
> >   Let us know if there is further information we can provide.
> > 
> > Dave
> > 
> > On Thu, 2005-07-28 at 22:51 -0400, Pradeep Padala wrote:
> > > Hi,
> > >     I am trying to understand Torque's behavior when a node fails. I am 
> > > checking the source, and I understand that check_nodes marks the node as 
> > > down by setting the node state to INUSE_DOWN, but I don't see any code 
> > > to move the jobs to somewhere else. What happens to the jobs running on 
> > > that node? Will the scheduler be told about the failed node?
> > > 
> > >     Any input is greatly appreciated.
> > > 
> > > Thanks,
> > 
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list