[torquedev] Torque 4.0 at startup
cwest at vpac.org
Fri Jul 13 00:19:25 MDT 2012
This is an observation on Torque 4.0.2+ (not sure about other 4.X
versions). I'm not sure if what I am seeing is a bug, or a change in the
way things are being monitored... would like to hear from the developers
about what the plan is/was, and what they were trying to achieve (if it
is not just a bug).
Setting up the scenario first. I have a test cluster with 5 nodes. 3 of
the nodes are online, one is powered on (but not running pbs_mom) and
the last is not even powered on.
It was a bit of a surprise to me that when I started pbs_server recently
all 5 nodes were reporting "free"... something that isn't even possible
(only 3 should have been free). So, after a few different tests I
figured that the default startup state of nodes is now "free". While I
don't recall what the original startup state, I don't recall it being
"free"... think it might have been "down".
It seems this is related to the status recorded in the
server_priv/node_status file. If I take the node "offline" (as opposed
to it just being "down" because pbs_mom is not running), the entry for
the node in the node_status file vanishes... restarting pbs_server has
the node start with an offline state. All the nodes that are online have
a "0" state.
Previously if a node was taken offline, it would have a "1" in the
node_status file (e.g. nodename 1). If a node was online then it didn't
have an entry at all. Thinking that perhaps the change was done to allow
for the large scale clusters benefit.
If the node is simply down there is no difference recorded in the
node_status file... that is at pbs_server startup it assumes the node
should be online, and marks it as free. I even deleted the file, which
caused the nodes to come online as they were detected rather than assume
they were all online.
So, why do I bring all this up??? I'm seeing a case here where MOAB is
trying to launch jobs on the nodes that are "down", and they fail, and
get pushed to defer (straight after a pbs_server start - which includes
a boot of the management node). It is possible these jobs could have
started on a different online node (or stayed in idle). Not a big issue
as the jobs should start again fine after the defer period.
However, it does mean that a job that was at the top of the Idle queue
(highest priority) could be deferred and then the lower priority jobs
get to start. This will cause an issue with the users...
Note: I get the following error on the job:
RM failure, rc: 15046, msg: 'Resource temporarily unavailable'
And then it is deferred (for an hour by default).
If this is a bug, I'll submit a ticket.
If the developers have changed something to improve things elsewhere
(there has been a lot of work to improve the large scale side of Torque)
then perhaps this has led to the issue.
Perhaps there is something I can do to work around the issue?
It does take a while before the node is finally detected as being down.
Note: It takes about 150 seconds for it to show up as "down" from "free"
after starting pbs_server. I have node_check_rate = 150 in the pbs
settings so I expect that is the trigger period. The default for pbs now
appears to be 600 seconds.
Craig West Systems Manager
Victorian Partnership for Advanced Computing
More information about the torquedev