[torquedev] Bug in post_epilogue()
David B Jackson
jacksond at clusterresources.com
Mon Aug 27 21:36:46 MDT 2007
post_epilogue() appears to have an issue in which if the pbs_mom daemon
fails to successfully send an obit message to the server on its first
attempt, it does not retry and from the point of view of the server,
jobs appear to hang for an extended period of time and cannot be killed.
It appears this routine has some code borrowed from scan_for_exiting()
which is retried but does not have the required recall points to allow
the same approach to work.
Basically, the question is if post_epilogue()->client_to_svr() fails,
how does mom know to re-call post_epilogue()? scan_for_terminated will
execute pjob->ji_mompost which pushes the obit but does not check the
routine's return code, and NULL's out pjob->ji_mompost in all cases
preventing post_epilogue() from ever being run again.
Are there suggestions for caching this request and making certain that
the obit makes it back to the server?
More information about the torquedev