[torqueusers] Torque/maui node failure policy

Marcus R. Epperson mrepper at sandia.gov
Mon Jun 18 19:21:44 MDT 2007


On 06/18/2007 06:30 PM, Garrick Staples wrote:
> On Mon, Jun 18, 2007 at 05:28:27PM -0700, Peter Wyckoff alleged:
>> Hi,
>>
>> I want to configure torque in such a way that if any node other than the
>> node running pbsdsh (the head node?) fails, do __NOTHING__  - don't cancel
>> the job or re-run it or anything.
>>
>> My code handles all failures other than the 1st node failing.
>>
>> Is there a way to configure torque to do nothing other than the head node?
>> Or do nothing no matter what ? (since head node failures should be rare as
>> opposed to other nodes).
> 
> TORQUE doesn't cancel jobs when sister nodes go down.  You might be
> seeing Maui do that, it has a 5 minute job delete hardwired in there.

Doesn't the MS kill the job if one or more IM polls fails?  We see these in our MS logs fairly often:

---------------------------------------------
pbs_mom;node_bailout, 242745.tbird-admin2 POLL failed from node dn637 7)
242745.tbird-admin2;kill_task: killing pid __ task __ with sig 9
...
pbs_mom;node_bailout, node_bailout: received KILL/ABORT request for job 242745.tbird-admin2 from node dn637
---------------------------------------------

And the user's .e file contains a 1099 error like this:
=>> PBS: job killed: node 7 (dn637) requested job terminate, 'EOF' (code 1099) - internal or network failure attempting to communicate with sister MOM's

I assumed you'd have to comment out the "send_sisters(pjob,IM_POLL_JOB)" portion of the main mom loop to avoid that.  Maybe I'm missing something though.

-Marcus



More information about the torqueusers mailing list