[torqueusers] Torque behavior with failed nodes

Pradeep Padala ppadala at eecs.umich.edu
Fri Jul 29 10:43:11 MDT 2005

Hi Dave,
    Thanks for the reply. I basically want to do this.
*) Submit a job
*) Jobs is running on a particular node
*) Kill the pbs_mom on that node
*) Move the job to a new node
*) Restart

    It doesn't seem like Torque+Maui can support this directly. So, I am 
doing this.

    I have two machines in my cluster.
*) Submit a job
*) qhold jobid
*) qrerun jobid
    At this stage the job is put into a hold state
*) I kill the pbs_mom on the first node and wait till PBS notices
    that the node is down
*) qalter -l neednodes= jobid
*) qrls jobid

    Now, it doesn't run and a checkjob -v reveals that

job is deferred.  Reason:  RMFailure  (cannot start job - RM failure, 
rc: 15041, msg: 'Execution server rejected request MSG=send failed, 
Holds:    Defer  (hold reason:  RMFailure)

    It's clear that Torque is still trying to run in it on the failed 
node. What am I doing wrong? Is there a better way to do this?


Dave Jackson wrote:
> Pradeep,
>   The responsibility for handling job migration on node failure
> detection belongs with the scheduler.  If using Moab, the parameter
> 'JOBACTIONONNODEFAILURE' can be set to a value such as requeue or cancel
> to inform the scheduler to requeue and restart the job, or just
> terminate it to free up the non-failed nodes which were allocated.
>   Maui provides a warning about this issue but does not automatically
> take action (use 'mdiag -j').  pbs_sched doesn't seem to do anything.  
>   Let us know if there is further information we can provide.
> Dave
> On Thu, 2005-07-28 at 22:51 -0400, Pradeep Padala wrote:
>>    I am trying to understand Torque's behavior when a node fails. I am 
>>checking the source, and I understand that check_nodes marks the node as 
>>down by setting the node state to INUSE_DOWN, but I don't see any code 
>>to move the jobs to somewhere else. What happens to the jobs running on 
>>that node? Will the scheduler be told about the failed node?
>>    Any input is greatly appreciated.

More information about the torqueusers mailing list