[torquedev] Bug in post_epilogue()

Garrick Staples garrick at usc.edu
Mon Aug 27 21:40:20 MDT 2007


On Mon, Aug 27, 2007 at 09:36:46PM -0600, David B Jackson alleged:
> post_epilogue() appears to have an issue in which if the pbs_mom daemon
> fails to successfully send an obit message to the server on its first
> attempt, it does not retry and from the point of view of the server,
> jobs appear to hang for an extended period of time and cannot be killed.
> It appears this routine has some code borrowed from scan_for_exiting()
> which is retried but does not have the required recall points to allow
> the same approach to work.
> 
>   Basically, the question is if post_epilogue()->client_to_svr() fails,
> how does mom know to re-call post_epilogue()?  scan_for_terminated will
> execute pjob->ji_mompost which pushes the obit but does not check the
> routine's return code, and NULL's out pjob->ji_mompost in all cases
> preventing post_epilogue() from ever being run again.
> 
>   Are there suggestions for caching this request and making certain that
> the obit makes it back to the server?

Isn't that fixed in trunk?

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20070827/3ef27a23/attachment.bin


More information about the torquedev mailing list