[torquedev] [Bug 206] New: Nodes start with state FREE when starting pbs_server

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Mon Jul 16 21:01:11 MDT 2012


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=206

           Summary: Nodes start with state FREE when starting pbs_server
           Product: TORQUE
           Version: 4.0.*
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: major
          Priority: P5
         Component: pbs_server
        AssignedTo: dbeer at adaptivecomputing.com
        ReportedBy: cwest at vpac.org
                CC: torquedev at supercluster.org
   Estimated Hours: 0.0


Using Torque 4.0.2 (and SVN 4.0-fixes), when starting the pbs_server all nodes
not marked "offline" when pbs_server was shutdown show up as "free".

We have a test setup where 3 of 5 nodes are online and running pbs_mom. One of
the nodes is offline but powered (not running pbs_mom) and the last node is
powered off. 
At startup of pbs_server all 5 nodes show up as free (and Moab detects 5
available nodes). Shortly after startup the 3 nodes that should be online get
some information in their "status" variable. The other two nodes still report
as free with no status information.
It takes about 150 seconds for the nodes that should be "down" to show up as
being down. I believe this is consistent with the pbs_server variable I have
set:
  node_check_rate = 150
Given the default for that is 600 seconds it could take longer for other sites.

It should be noted if a node is marked "offline" prior to pbs_server being
restarted that the node starts in the offline state.

We have noticed that server_priv/node_status is different to earlier versions
of torque. Previously this file contained a "1" if the node was offline. Now it
appears to contain a "0" if the node is NOT offline. Any offline nodes do not
appear in this file. By NOT offline we mean that the node could be "down" or
online ("free", "job-exclusive", etc). 

We have seen a case (with Moab 7.0.2 and Torque 4.0.2) where when starting up
pbs_server followed by Moab that Moab then attempts to schedule jobs to nodes
that are not available (in the example above both the node not running pbs_mom
and the node not powered were allocated jobs). These jobs failed to start, got
put in a DEFER state by Moab and had the following error:
  RM failure, rc: 15046, msg: 'Resource temporarily unavailable'


Known workaround for our site is to wait for the 150 seconds until the nodes
are listed as down before starting (or resuming) scheduling. This cluster is a
test cluster so it is not a major issue - yet.

Craig.

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list