[torqueusers] Torque/maui node failure policy revisted again

Craig West cwest at astro.umass.edu
Tue Dec 16 09:24:49 MST 2008


There is the case when you are launching multiple serial jobs from a 
single script. pbsdsh is useful at that.

One problem comes to mind and that is what happens if the first node in 
the list (exec_host) is the node that goes MIA? Would this mean that 
torque would loose control and tracking of the running jobs?
Perhaps if any node but the first node fails then the job should 
continue to run, if the first node dies the job should be terminated???

Craig.

On 12/15/2008 04:54 PM, Chris Samuel wrote:
> ----- Charles at Schwieters.org wrote:
>   
>>   I saw a message with this subject from June, 2007 along with a
>> patch creating a fatal_job_poll_failure mom_priv/config option.
>> This option prevents the failure of a single node deleting an
>> entire job.
>>     
>
> Could you explain why you would not want the existing
> behaviour please ?
>
> For all codes that I'm aware of at present losing a
> single node of a parallel job means the parallel job
> has failed so I'd be really interested to hear where
> that's not necessarily the case.
>   


More information about the torqueusers mailing list