[torqueusers] deadlock in torque p6

Marcin Mogielnicki mar_mog at o2.pl
Thu Feb 3 06:03:37 MST 2005

Hello everyone,

It happened for some nodes in my cluster to get into, hm, deadlock mode. 
It happens when the node has busy state and it suddenly goes down. The
next time it starts loadaverage is below given minimal load, so state of 
the node is not updated. It won't be until local activity goes so high
that max load is exceeded. It's almost impossible for strictly 
computational nodes, so they are idle, but server thinks that they are
busy. It lasts, and last, and lasts...

The solution would be to update the state of the node every time mom is 
started. It can be done in a very simple way. The patch is given below.

And now my question - is it really solution for this problem or am I 
going wrong way? I have very strange feeling that some of offline nodes 
went online on their own after introducing this patch. It's difficult 
for me to check it now because all the nodes became busy after starting 
patched mom.

        Marcin Mogielnicki, ICM, Poland

halo torque-1.1.0p6.tempmara # diff -u src/resmom/mom_main.c.orig
--- src/resmom/mom_main.c.orig 2005-02-02 21:10:06.647350573 +0100
+++ src/resmom/mom_main.c 2005-02-02 21:10:39.746373056 +0100
@@ -159,7 +159,7 @@
 unsigned int default_server_port;
 int exiting_tasks = 0;
 float ideal_load_val = -1.0;
- int internal_state = 0;
+ int internal_state= UPDATE_MOM_STATE;

 int lockfds = -1;
 time_t loopcnt;  /* used for MD5 calc */

More information about the torqueusers mailing list