[torqueusers] Re: Jobs hanging in R state with torque 2.3.0
aripollak at gmail.com
Thu Apr 17 21:11:19 MDT 2008
Unfortunately, that didn't seem to fix the problem when using a new
mom & server from the latest 2.3.1 snapshot with that option, though
the issue seems to occur slightly less often now.
On Thu, Apr 17, 2008 at 2:27 PM, Joshua Butikofer
<josh at clusterresources.com> wrote:
> Another possible solution to this problem is to configure TORQUE with the following ./configure
> flag: --enable-nochildsignal. We found with other customers that there is indeed a race condition
> having to do with children being harvested via a signal handler. This configure flag moves the
> "catch_child" functions out of that handler and into an oft run function. In our tests this does not
> hurt performance at all and DOES fix the race condition. It also avoids sending extra obits.
> Give it a try if you get the chance and let us know if you see your problem go away. If so, it may
> be worthwhile to make this flag the default for future versions of TORQUE.
> --Josh Butikofer
More information about the torqueusers