[torquedev] Torque 4.0 at startup
dbeer at adaptivecomputing.com
Fri Jul 13 10:22:18 MDT 2012
That is most certainly a bug. We will see if we can reproduce it but that
definitely shouldn't be happening. Nodes should be down until they report
On Fri, Jul 13, 2012 at 12:19 AM, Craig West <cwest at vpac.org> wrote:
> Hi All,
> This is an observation on Torque 4.0.2+ (not sure about other 4.X
> versions). I'm not sure if what I am seeing is a bug, or a change in the
> way things are being monitored... would like to hear from the developers
> about what the plan is/was, and what they were trying to achieve (if it
> is not just a bug).
> Setting up the scenario first. I have a test cluster with 5 nodes. 3 of
> the nodes are online, one is powered on (but not running pbs_mom) and
> the last is not even powered on.
> It was a bit of a surprise to me that when I started pbs_server recently
> all 5 nodes were reporting "free"... something that isn't even possible
> (only 3 should have been free). So, after a few different tests I
> figured that the default startup state of nodes is now "free". While I
> don't recall what the original startup state, I don't recall it being
> "free"... think it might have been "down".
> It seems this is related to the status recorded in the
> server_priv/node_status file. If I take the node "offline" (as opposed
> to it just being "down" because pbs_mom is not running), the entry for
> the node in the node_status file vanishes... restarting pbs_server has
> the node start with an offline state. All the nodes that are online have
> a "0" state.
> Previously if a node was taken offline, it would have a "1" in the
> node_status file (e.g. nodename 1). If a node was online then it didn't
> have an entry at all. Thinking that perhaps the change was done to allow
> for the large scale clusters benefit.
> If the node is simply down there is no difference recorded in the
> node_status file... that is at pbs_server startup it assumes the node
> should be online, and marks it as free. I even deleted the file, which
> caused the nodes to come online as they were detected rather than assume
> they were all online.
> So, why do I bring all this up??? I'm seeing a case here where MOAB is
> trying to launch jobs on the nodes that are "down", and they fail, and
> get pushed to defer (straight after a pbs_server start - which includes
> a boot of the management node). It is possible these jobs could have
> started on a different online node (or stayed in idle). Not a big issue
> as the jobs should start again fine after the defer period.
> However, it does mean that a job that was at the top of the Idle queue
> (highest priority) could be deferred and then the lower priority jobs
> get to start. This will cause an issue with the users...
> Note: I get the following error on the job:
> RM failure, rc: 15046, msg: 'Resource temporarily unavailable'
> And then it is deferred (for an hour by default).
> If this is a bug, I'll submit a ticket.
> If the developers have changed something to improve things elsewhere
> (there has been a lot of work to improve the large scale side of Torque)
> then perhaps this has led to the issue.
> Perhaps there is something I can do to work around the issue?
> It does take a while before the node is finally detected as being down.
> Note: It takes about 150 seconds for it to show up as "down" from "free"
> after starting pbs_server. I have node_check_rate = 150 in the pbs
> settings so I expect that is the trigger period. The default for pbs now
> appears to be 600 seconds.
> Craig West Systems Manager
> Victorian Partnership for Advanced Computing
> torquedev mailing list
> torquedev at supercluster.org
David Beer | Software Engineer
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torquedev