Bugzilla – Bug 206
Nodes start with state FREE when starting pbs_server
Last modified: 2012-08-06 22:55:55 MDT
You need to log in before you can comment on or make changes to this bug.
Using Torque 4.0.2 (and SVN 4.0-fixes), when starting the pbs_server all nodes not marked "offline" when pbs_server was shutdown show up as "free". We have a test setup where 3 of 5 nodes are online and running pbs_mom. One of the nodes is offline but powered (not running pbs_mom) and the last node is powered off. At startup of pbs_server all 5 nodes show up as free (and Moab detects 5 available nodes). Shortly after startup the 3 nodes that should be online get some information in their "status" variable. The other two nodes still report as free with no status information. It takes about 150 seconds for the nodes that should be "down" to show up as being down. I believe this is consistent with the pbs_server variable I have set: node_check_rate = 150 Given the default for that is 600 seconds it could take longer for other sites. It should be noted if a node is marked "offline" prior to pbs_server being restarted that the node starts in the offline state. We have noticed that server_priv/node_status is different to earlier versions of torque. Previously this file contained a "1" if the node was offline. Now it appears to contain a "0" if the node is NOT offline. Any offline nodes do not appear in this file. By NOT offline we mean that the node could be "down" or online ("free", "job-exclusive", etc). We have seen a case (with Moab 7.0.2 and Torque 4.0.2) where when starting up pbs_server followed by Moab that Moab then attempts to schedule jobs to nodes that are not available (in the example above both the node not running pbs_mom and the node not powered were allocated jobs). These jobs failed to start, got put in a DEFER state by Moab and had the following error: RM failure, rc: 15046, msg: 'Resource temporarily unavailable' Known workaround for our site is to wait for the 150 seconds until the nodes are listed as down before starting (or resuming) scheduling. This cluster is a test cluster so it is not a major issue - yet. Craig.
After looking through the code I believe I have found some useful information. The file affected is server/node_manager.c It appears to have happened around r4798 or r4799, but I can't be sure. For simplicity I will attach a patch for "my" fix, which has not been extensively tested. /* ** The only state that carries forward is if the ** node has been marked offline. */ while ((np = next_host(&allnodes,&iter,NULL)) != NULL) { if (!(np->nd_state & INUSE_OFFLINE)) { fprintf(nstatef, fmt, np->nd_name, np->nd_state & savemask); } If I remove the "!" on the line with "np->nd_state" it seems to work, and it follows the statement above the code block now. As stated in my previous description of the problem the earlier versions of Torque put a "1" in the server_priv/node_state file for nodes that ARE offline. The current code puts a "0" if the node is NOT offline. If the node is not in the server_priv/node_state file then problem doesn't appear. Craig.
Created an attachment (id=117) [details] Makes nodes appear in server_priv if the are OFFLINE Nodes will appear in server_priv if they are offline.