[torqueusers] deadlock in torque p6

Garrick Staples garrick at usc.edu
Thu Feb 3 18:30:01 MST 2005


On Thu, Feb 03, 2005 at 02:03:37PM +0100, Marcin Mogielnicki alleged:
> Hello everyone,
> 
> It happened for some nodes in my cluster to get into, hm, deadlock mode. 
> It happens when the node has busy state and it suddenly goes down. The
> next time it starts loadaverage is below given minimal load, so state of 
> the node is not updated. It won't be until local activity goes so high
> that max load is exceeded. It's almost impossible for strictly 
> computational nodes, so they are idle, but server thinks that they are
> busy. It lasts, and last, and lasts...
> 
> The solution would be to update the state of the node every time mom is 
> started. It can be done in a very simple way. The patch is given below.

Hrm, could have sworn we fixed this already.  I need to look back through my
patches.

 
> And now my question - is it really solution for this problem or am I 
> going wrong way? I have very strange feeling that some of offline nodes 
> went online on their own after introducing this patch. It's difficult 
> for me to check it now because all the nodes became busy after starting 
> patched mom.

I think you are noticing a different bug.  Which I also thought we fixed
already.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050203/0965d39e/attachment-0001.bin


More information about the torqueusers mailing list