[torqueusers] Re: Jobs hanging in R state with torque 2.3.0 (workaround)

Ari Pollak aripollak at gmail.com
Thu Apr 17 21:11:19 MDT 2008


Unfortunately, that didn't seem to fix the problem when using a new
mom & server from the latest 2.3.1 snapshot with that option, though
the issue seems to occur slightly less often now.

On Thu, Apr 17, 2008 at 2:27 PM, Joshua Butikofer
<josh at clusterresources.com> wrote:
> Ari,
>
>  Another possible solution to this problem is to configure TORQUE with the following ./configure
>  flag: --enable-nochildsignal. We found with other customers that there is indeed a race condition
>  having to do with children being harvested via a signal handler. This configure flag moves the
>  "catch_child" functions out of that handler and into an oft run function. In our tests this does not
>  hurt performance at all and DOES fix the race condition. It also avoids sending extra obits.
>
>  Give it a try if you get the chance and let us know if you see your problem go away. If so, it may
>  be worthwhile to make this flag the default for future versions of TORQUE.
>
>  --Josh Butikofer
>


More information about the torqueusers mailing list