[torqueusers] Inconsistent behavior in updating trusted client list
of MOMs
Hemanth Yamijala
yhemanth at yahoo-inc.com
Tue Oct 16 11:24:37 MDT 2007
Hi,
The torque wiki mentions that dynamic addition of nodes must be
accompanied with a restart of the pbs_server.
(Ref: http://www.clusterresources.com/wiki/doku.php?
id=torque:3.1_adding_nodes)
One of the reasons for this could be that the trusted client list on the
MOMs is not updated in a consistent manner after node addition.
The test I performed was as follows:
My original setup had one host address in the pbs_server nodes file. The
MOM on this node had 3 addresses in it's trusted client list: localhost,
its own address, and the pbs_server's address.
Then, using qmgr, I added one more node. This resulted in an update to
the MOM's trusted client list. After awhile, I added one more node.
However, this did not send any update.
Tracking this down in the code, I narrowed the problem to the function
send_cluster_addrs in server/node_manager.c. In this, it appears that
after an update is sent to all the nodes when a node is added, the
variable startcount should be reset to 0. If this is not done, the
variable, being static, retains it's value even when processing a new
node and essentially steps out of the pbsndmast array without going over
the entire list again.
The following patch (on TRUNK) addresses this issue:
Index: src/server/node_manager.c
===================================================================
--- src/server/node_manager.c (revision 1572)
+++ src/server/node_manager.c (working copy)
@@ -1003,6 +1003,8 @@
delete_link(&nnew->nn_link);
}
+ /* reset startcount, as we've sent the updates for all servers */
+ startcount = 0;
}
} /* END send_cluster_addrs */
===================================================================
I am not fully familiar with the code. Can someone please verify if my
analysis and fix is right ?
Thanks
Hemanth
More information about the torqueusers
mailing list