[torqueusers] Torque behavior with failed nodes

Pradeep Padala ppadala at eecs.umich.edu
Fri Jul 29 11:06:30 MDT 2005


>   Are you killing the mother superior or a child node.  I believe there
> are still issues if you kill the mother superior with TORQUE releasing
> the job for re-execution on another node.  Contributing sites have
> already improved TORQUE to allow a job running on a failed mother
> superior to be canceled.  I believe similar changes can be made to allow
> jobs running on a failed mother superior to be requeued.
> 
>   We will check to see what the next steps are on this front.  Can you
> verify that everything works if a child node fails?

I am only killing the child node by just killing the pbs_mom running on 
that node. Actually, I just tried this and the job is re-scheduled to 
the free nodes.

*) Submit a job
*) qhold jobid
*) qrerun jobid
    At this stage the job is put into a hold state
*) I kill the pbs_mom on the first node and wait till PBS notices
     that the node is down
*) Instead of qalter I did runjob -c jobid
*) qrls jobid

I think with qalter, only Torque clears the attributes and Maui still 
has the hostlist set to the old values. Am I right?

-- 
Pradeep Padala
http://ppadala.blogspot.com

> On Fri, 2005-07-29 at 12:43 -0400, Pradeep Padala wrote:
> 
>>Hi Dave,
>>    Thanks for the reply. I basically want to do this.
>>*) Submit a job
>>*) Jobs is running on a particular node
>>*) Kill the pbs_mom on that node
>>*) Move the job to a new node
>>*) Restart
>>
>>    It doesn't seem like Torque+Maui can support this directly. So, I am 
>>doing this.
>>
>>    I have two machines in my cluster.
>>*) Submit a job
>>*) qhold jobid
>>*) qrerun jobid
>>    At this stage the job is put into a hold state
>>*) I kill the pbs_mom on the first node and wait till PBS notices
>>    that the node is down
>>*) qalter -l neednodes= jobid
>>*) qrls jobid
>>
>>    Now, it doesn't run and a checkjob -v reveals that
>>
>>job is deferred.  Reason:  RMFailure  (cannot start job - RM failure, 
>>rc: 15041, msg: 'Execution server rejected request MSG=send failed, 
>>STARTING')
>>Holds:    Defer  (hold reason:  RMFailure)
>>
>>    It's clear that Torque is still trying to run in it on the failed 
>>node. What am I doing wrong? Is there a better way to do this?
>>
>>Pradeep
>>
>>Dave Jackson wrote:
>>
>>>Pradeep,
>>>
>>>  The responsibility for handling job migration on node failure
>>>detection belongs with the scheduler.  If using Moab, the parameter
>>>'JOBACTIONONNODEFAILURE' can be set to a value such as requeue or cancel
>>>to inform the scheduler to requeue and restart the job, or just
>>>terminate it to free up the non-failed nodes which were allocated.
>>>
>>>  Maui provides a warning about this issue but does not automatically
>>>take action (use 'mdiag -j').  pbs_sched doesn't seem to do anything.  
>>>
>>>  Let us know if there is further information we can provide.
>>>
>>>Dave
>>>
>>>On Thu, 2005-07-28 at 22:51 -0400, Pradeep Padala wrote:
>>>
>>>
>>>>Hi,
>>>>   I am trying to understand Torque's behavior when a node fails. I am 
>>>>checking the source, and I understand that check_nodes marks the node as 
>>>>down by setting the node state to INUSE_DOWN, but I don't see any code 
>>>>to move the jobs to somewhere else. What happens to the jobs running on 
>>>>that node? Will the scheduler be told about the failed node?
>>>>
>>>>   Any input is greatly appreciated.
>>>>
>>>>Thanks,




More information about the torqueusers mailing list