Bug 206 - Nodes start with state FREE when starting pbs_server
: Nodes start with state FREE when starting pbs_server
Status: NEW
Product: TORQUE
: 4.0.*
: PC Linux
: P5 major
Assigned To: David Beer
  Show dependency treegraph
Reported: 2012-07-16 21:01 MDT by Craig West
Modified: 2012-08-06 22:55 MDT (History)
1 user (show)

See Also:

Makes nodes appear in server_priv if the are OFFLINE (438 bytes, patch)
2012-08-06 22:55 MDT, Craig West
Details | Diff


You need to log in before you can comment on or make changes to this bug.

Description Craig West 2012-07-16 21:01:11 MDT
Using Torque 4.0.2 (and SVN 4.0-fixes), when starting the pbs_server all nodes
not marked "offline" when pbs_server was shutdown show up as "free".

We have a test setup where 3 of 5 nodes are online and running pbs_mom. One of
the nodes is offline but powered (not running pbs_mom) and the last node is
powered off. 
At startup of pbs_server all 5 nodes show up as free (and Moab detects 5
available nodes). Shortly after startup the 3 nodes that should be online get
some information in their "status" variable. The other two nodes still report
as free with no status information.
It takes about 150 seconds for the nodes that should be "down" to show up as
being down. I believe this is consistent with the pbs_server variable I have
  node_check_rate = 150
Given the default for that is 600 seconds it could take longer for other sites.

It should be noted if a node is marked "offline" prior to pbs_server being
restarted that the node starts in the offline state.

We have noticed that server_priv/node_status is different to earlier versions
of torque. Previously this file contained a "1" if the node was offline. Now it
appears to contain a "0" if the node is NOT offline. Any offline nodes do not
appear in this file. By NOT offline we mean that the node could be "down" or
online ("free", "job-exclusive", etc). 

We have seen a case (with Moab 7.0.2 and Torque 4.0.2) where when starting up
pbs_server followed by Moab that Moab then attempts to schedule jobs to nodes
that are not available (in the example above both the node not running pbs_mom
and the node not powered were allocated jobs). These jobs failed to start, got
put in a DEFER state by Moab and had the following error:
  RM failure, rc: 15046, msg: 'Resource temporarily unavailable'

Known workaround for our site is to wait for the 150 seconds until the nodes
are listed as down before starting (or resuming) scheduling. This cluster is a
test cluster so it is not a major issue - yet.

Comment 1 Craig West 2012-08-06 22:52:19 MDT
After looking through the code I believe I have found some useful information.

The file affected is server/node_manager.c
It appears to have happened around r4798 or r4799, but I can't be sure.
For simplicity I will attach a patch for "my" fix, which has not been
extensively tested.

  ** The only state that carries forward is if the
  ** node has been marked offline.

  while ((np = next_host(&allnodes,&iter,NULL)) != NULL)
    if (!(np->nd_state & INUSE_OFFLINE))
      fprintf(nstatef, fmt, np->nd_name, np->nd_state & savemask);

If I remove the "!" on the line with "np->nd_state" it seems to work, and it
follows the statement above the code block now. 

As stated in my previous description of the problem the earlier versions of
Torque put a "1" in the server_priv/node_state file for nodes that ARE offline.
The current code puts a "0" if the node is NOT offline. If the node is not in
the server_priv/node_state file then problem doesn't appear.

Comment 2 Craig West 2012-08-06 22:55:55 MDT
Created an attachment (id=117) [details]
Makes nodes appear in server_priv if the are OFFLINE

Nodes will appear in server_priv if they are offline.