[torqueusers] Can i control if the jobs dies or not??

Garrick Staples garrick at usc.edu
Wed Aug 10 20:33:30 MDT 2005


On Wed, Aug 10, 2005 at 06:13:32PM -0700, Garrick Staples alleged:
> On Wed, Aug 10, 2005 at 08:23:24AM -0300, Leandro alleged:
> > behavior of PBS/Torque is kill the job when a node dies. Can i change this 
> > behavior? If there's no way to do tha with some kind of configuration, can 
> > someone point me in the code where i can work on this?
> 
> At this point in time, the MOM on the execution node (MS) will always kill the
> job if a sister MOM isn't replying.
> 
> MS sends IM_POLL_JOB messages to sisters.  When a sister isn't replying, MS
> closes the connection with mom_comm.c:im_eof() which calls
> mom_comm.c:node_bailout().  With outstanding IM_POLL_JOB messages,
> node_bailout() sets "pjob->ji_nodekill = np->hn_node;" and
> mom_main.c:job_over_limit() kills the job if "pjob->ji_nodekill !=
> TM_ERROR_NODE".

I haven't tried this yet, but this should do the trick:

--- src/resmom/mom_comm.c_orig   2005-07-26 23:24:55.000000000 -0700
+++ src/resmom/mom_comm.c        2005-08-10 19:25:45.000000000 -0700
@@ -1101,8 +1101,6 @@ void node_bailout(
 
         log_err(-1,id,log_buffer);
 
-        pjob->ji_nodekill = np->hn_node;
-
         break;
 
       case IM_GET_TID:


-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050810/d53bbe3c/attachment.bin


More information about the torqueusers mailing list