[torqueusers] deadlock in torque p6
Marcin Mogielnicki
mar_mog at o2.pl
Thu Feb 3 06:03:37 MST 2005
Hello everyone,
It happened for some nodes in my cluster to get into, hm, deadlock mode.
It happens when the node has busy state and it suddenly goes down. The
next time it starts loadaverage is below given minimal load, so state of
the node is not updated. It won't be until local activity goes so high
that max load is exceeded. It's almost impossible for strictly
computational nodes, so they are idle, but server thinks that they are
busy. It lasts, and last, and lasts...
The solution would be to update the state of the node every time mom is
started. It can be done in a very simple way. The patch is given below.
And now my question - is it really solution for this problem or am I
going wrong way? I have very strange feeling that some of offline nodes
went online on their own after introducing this patch. It's difficult
for me to check it now because all the nodes became busy after starting
patched mom.
Marcin Mogielnicki, ICM, Poland
halo torque-1.1.0p6.tempmara # diff -u src/resmom/mom_main.c.orig
src/resmom/mom_main.c
--- src/resmom/mom_main.c.orig 2005-02-02 21:10:06.647350573 +0100
+++ src/resmom/mom_main.c 2005-02-02 21:10:39.746373056 +0100
@@ -159,7 +159,7 @@
unsigned int default_server_port;
int exiting_tasks = 0;
float ideal_load_val = -1.0;
- int internal_state = 0;
+ int internal_state= UPDATE_MOM_STATE;
int lockfds = -1;
time_t loopcnt; /* used for MD5 calc */
More information about the torqueusers
mailing list