[torqueusers] Re: Jobs hanging in R state with torque 2.3.0 (workaround)

Joshua Butikofer josh at clusterresources.com
Thu Apr 17 12:27:45 MDT 2008


Ari,

Another possible solution to this problem is to configure TORQUE with the following ./configure
flag: --enable-nochildsignal. We found with other customers that there is indeed a race condition
having to do with children being harvested via a signal handler. This configure flag moves the
"catch_child" functions out of that handler and into an oft run function. In our tests this does not
hurt performance at all and DOES fix the race condition. It also avoids sending extra obits.

Give it a try if you get the chance and let us know if you see your problem go away. If so, it may
be worthwhile to make this flag the default for future versions of TORQUE.

--Josh Butikofer

Ari Pollak wrote:
> As a workaround, I've changed line 547 in src/resmom/catch_child.c to this:
> 
>     if (pjob->ji_qs.ji_substate != JOB_SUBSTATE_EXITING &&
>             pjob->ji_qs.ji_substate != JOB_SUBSTATE_OBIT)
> 
> So it will try sending the obit again, even if it thinks it's already
> being sent. This seems to eliminate the problem for me, and I'm not
> seeing any ill effects. I also found a comment in post_epilogue() that
> would indicate a proper retry is supposed to happen but was never
> implemented.
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list