[torqueusers] nodes file - basic install problems
gus at ldeo.columbia.edu
Fri Jul 6 09:30:07 MDT 2012
On 07/06/2012 12:58 AM, Rob Holmes wrote:
> I’m installing a small HPC cluster at work, which I’ve never done before
> and it’s causing me problems.
> My nodes file contains 14 compute nodes named node01, node02, etc. At
> the moment I just have four nodes switched on, with the remainder shown
> as ‘down’ with pbsnodes -a. When I submit a number of jobs, jobs are
> submitted to the first two nodes with the remaining two marked as
> ‘free’, regardless of how many jobs are waiting to be submitted. Jobs
> are kept in the queue until either of node01 or node02 come free, then
> are run. node03 and node12 (the other two live nodes) never run a job.
> However, when I remove node01 for example (by commenting out node01 in
> the nodes file and restarting pbs_server), jobs will run on node12.
> Bizarrely, node03 is then marked as ‘down’ in pbsnodes –a.
> This is long but basically I’m getting a lot of odd behavior and I’m not
> sure where to start debugging. All live nodes are running pbs_mom. The
> system was working as expected with just one compute node. With more
> than one it is having problems. I’m running pbs_sched. Can anyone please
> *Rob Holmes***
> *Environmental Scientist – Catchments and Receiving Environments***
> *BMT WBM Pty Ltd
> *Level 8, 200 Creek Street
> Brisbane QLD 4000 Australia
> *P: *+61 7 3831 6744
> *F: *
> *W: www.bmtwbm.com.au*
Maybe if you send more information,
it will ring a bell.
Here's a bunch of somewhat random possibilities.
1) It may help if you send the output of
qmgr -c 'p s'
of your nodes file, and
of pbsnodes -a
2) Did you set 'np=XX' on the various lines of the nodes file?
3) Any chance that the queue[s] or the server
is[are] configured with a maximum number of nodes or
4) Did you add any properties to the nodes, then
perhaps used them to restrict job access to some nodes or queues?
5) Is pbs_mom running on all the four nodes that are up?
6) Any funny stuff in /var/log/messages on the server and
perhaps on the compute nodes?
7) Likewise for $TORQUE/server_logs/YYYYMMDD [server],
or $TORQUE/mom_logs/YYYYMMDD [compute nodes] ?
8) Any chance that the node names cannot be resolved?
This is typically done in /etc/hosts on all nodes.
Normally each node name is associated to a
private [to the cluster] subnet.
9) Are the node's [Ethernet?] interfaces up and
configured with the right IP addresses [as shown by ifconfig -a]?
10) Can you ping across every pair of nodes through the expected
route [ping -R ...]?
11) Any firewalls perhaps blocking the access to the nodes?
I hope this helps,
More information about the torqueusers