[torquedev] Bug in post_epilogue()

Dave Jackson jacksond at clusterresources.com
Tue Aug 28 10:38:38 MDT 2007


Garrick,

  I don't see a fix in trunk.  In fact, the bug was first detected and
reported on a recent trunk based distribution.  Is there a possibility a
fix did not get committed?  Does the fix involve mompost return code
checks and retry from within scan_for_terminated()?

Dave

On Mon, 2007-08-27 at 20:40 -0700, Garrick Staples wrote:
> On Mon, Aug 27, 2007 at 09:36:46PM -0600, David B Jackson alleged:
> > post_epilogue() appears to have an issue in which if the pbs_mom daemon
> > fails to successfully send an obit message to the server on its first
> > attempt, it does not retry and from the point of view of the server,
> > jobs appear to hang for an extended period of time and cannot be killed.
> > It appears this routine has some code borrowed from scan_for_exiting()
> > which is retried but does not have the required recall points to allow
> > the same approach to work.
> > 
> >   Basically, the question is if post_epilogue()->client_to_svr() fails,
> > how does mom know to re-call post_epilogue()?  scan_for_terminated will
> > execute pjob->ji_mompost which pushes the obit but does not check the
> > routine's return code, and NULL's out pjob->ji_mompost in all cases
> > preventing post_epilogue() from ever being run again.
> > 
> >   Are there suggestions for caching this request and making certain that
> > the obit makes it back to the server?
> 
> Isn't that fixed in trunk?
> 
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev



More information about the torquedev mailing list