[torquedev] Torque 4.0 at startup

Craig West cwest at vpac.org
Fri Jul 13 00:19:25 MDT 2012


Hi All,

This is an observation on Torque 4.0.2+ (not sure about other 4.X 
versions). I'm not sure if what I am seeing is a bug, or a change in the 
way things are being monitored... would like to hear from the developers 
about what the plan is/was, and what they were trying to achieve (if it 
is not just a bug).

Setting up the scenario first. I have a test cluster with 5 nodes. 3 of 
the nodes are online, one is powered on (but not running pbs_mom) and 
the last is not even powered on.

It was a bit of a surprise to me that when I started pbs_server recently 
all 5 nodes were reporting "free"... something that isn't even possible 
(only 3 should have been free). So, after a few different tests I 
figured that the default startup state of nodes is now "free". While I 
don't recall what the original startup state, I don't recall it being 
"free"... think it might have been "down".

It seems this is related to the status recorded in the 
server_priv/node_status file. If I take the node "offline" (as opposed 
to it just being "down" because pbs_mom is not running), the entry for 
the node in the node_status file vanishes... restarting pbs_server has 
the node start with an offline state. All the nodes that are online have 
a "0" state.

Previously if a node was taken offline, it would have a "1" in the 
node_status file (e.g. nodename 1). If a node was online then it didn't 
have an entry at all. Thinking that perhaps the change was done to allow 
for the large scale clusters benefit.

If the node is simply down there is no difference recorded in the 
node_status file... that is at pbs_server startup it assumes the node 
should be online, and marks it as free. I even deleted the file, which 
caused the nodes to come online as they were detected rather than assume 
they were all online.


So, why do I bring all this up??? I'm seeing a case here where MOAB is 
trying to launch jobs on the nodes that are "down", and they fail, and 
get pushed to defer (straight after a pbs_server start - which includes 
a boot of the management node). It is possible these jobs could have 
started on a different online node (or stayed in idle). Not a big issue 
as the jobs should start again fine after the defer period.
However, it does mean that a job that was at the top of the Idle queue 
(highest priority) could be deferred and then the lower priority jobs 
get to start. This will cause an issue with the users...

Note: I get the following error on the job:
RM failure, rc: 15046, msg: 'Resource temporarily unavailable'
And then it is deferred (for an hour by default).


If this is a bug, I'll submit a ticket.
If the developers have changed something to improve things elsewhere 
(there has been a lot of work to improve the large scale side of Torque) 
then perhaps this has led to the issue.
Perhaps there is something I can do to work around the issue?
It does take a while before the node is finally detected as being down.

Note: It takes about 150 seconds for it to show up as "down" from "free" 
after starting pbs_server. I have node_check_rate = 150 in the pbs 
settings so I expect that is the trigger period. The default for pbs now 
appears to be 600 seconds.


Cheers,
Craig.

-- 
Craig West                   Systems Manager
Victorian Partnership for Advanced Computing


More information about the torquedev mailing list