[torqueusers] deadlock in torque p6
garrick at usc.edu
Thu Feb 3 18:30:01 MST 2005
On Thu, Feb 03, 2005 at 02:03:37PM +0100, Marcin Mogielnicki alleged:
> Hello everyone,
> It happened for some nodes in my cluster to get into, hm, deadlock mode.
> It happens when the node has busy state and it suddenly goes down. The
> next time it starts loadaverage is below given minimal load, so state of
> the node is not updated. It won't be until local activity goes so high
> that max load is exceeded. It's almost impossible for strictly
> computational nodes, so they are idle, but server thinks that they are
> busy. It lasts, and last, and lasts...
> The solution would be to update the state of the node every time mom is
> started. It can be done in a very simple way. The patch is given below.
Hrm, could have sworn we fixed this already. I need to look back through my
> And now my question - is it really solution for this problem or am I
> going wrong way? I have very strange feeling that some of offline nodes
> went online on their own after introducing this patch. It's difficult
> for me to check it now because all the nodes became busy after starting
> patched mom.
I think you are noticing a different bug. Which I also thought we fixed
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050203/0965d39e/attachment-0001.bin
More information about the torqueusers