[torqueusers] Can i control if the jobs dies or not??

Garrick Staples garrick at usc.edu
Thu Aug 11 09:55:29 MDT 2005


On Thu, Aug 11, 2005 at 08:22:34AM -0300, Leandro alleged:
> Thank you for the information. I will test it and any news i will reply to 
> you.
> 
> This patch is for the latest snapshot?

I suspect that part of the code hasn't been changed in years, so I wouldn't
worry about the version.


> 
> Regards,
> 
> -- 
> Leandro Tavares Carneiro
> Analista de Suporte Linux/Unix 
> 
> 2005/8/10, Garrick Staples <garrick at usc.edu>:
> > 
> > On Wed, Aug 10, 2005 at 06:13:32PM -0700, Garrick Staples alleged:
> > > On Wed, Aug 10, 2005 at 08:23:24AM -0300, Leandro alleged:
> > > > behavior of PBS/Torque is kill the job when a node dies. Can i change 
> > this
> > > > behavior? If there's no way to do tha with some kind of configuration, 
> > can
> > > > someone point me in the code where i can work on this?
> > >
> > > At this point in time, the MOM on the execution node (MS) will always 
> > kill the
> > > job if a sister MOM isn't replying.
> > >
> > > MS sends IM_POLL_JOB messages to sisters. When a sister isn't replying, 
> > MS
> > > closes the connection with mom_comm.c:im_eof() which calls
> > > mom_comm.c:node_bailout(). With outstanding IM_POLL_JOB messages,
> > > node_bailout() sets "pjob->ji_nodekill = np->hn_node;" and
> > > mom_main.c:job_over_limit() kills the job if "pjob->ji_nodekill !=
> > > TM_ERROR_NODE".
> > 
> > I haven't tried this yet, but this should do the trick:
> > 
> > --- src/resmom/mom_comm.c_orig 2005-07-26 23:24:55.000000000 -0700
> > +++ src/resmom/mom_comm.c 2005-08-10 19:25:45.000000000 -0700
> > @@ -1101,8 +1101,6 @@ void node_bailout(
> > 
> > log_err(-1,id,log_buffer);
> > 
> > - pjob->ji_nodekill = np->hn_node;
> > -
> > break;
> > 
> > case IM_GET_TID:
> > 
> > 
> > --
> > Garrick Staples, Linux/HPCC Administrator
> > University of Southern California
> > 
> > 
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> > 
> > 
> > 
> >

> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050811/fa7df4d3/attachment.bin


More information about the torqueusers mailing list