[torqueusers] nodes file - basic install problems

Gus Correa gus at ldeo.columbia.edu
Fri Jul 6 09:30:07 MDT 2012

On 07/06/2012 12:58 AM, Rob Holmes wrote:
> Hi,
> I’m installing a small HPC cluster at work, which I’ve never done before
> and it’s causing me problems.
> My nodes file contains 14 compute nodes named node01, node02, etc. At
> the moment I just have four nodes switched on, with the remainder shown
> as ‘down’ with pbsnodes -a. When I submit a number of jobs, jobs are
> submitted to the first two nodes with the remaining two marked as
> ‘free’, regardless of how many jobs are waiting to be submitted. Jobs
> are kept in the queue until either of node01 or node02 come free, then
> are run. node03 and node12 (the other two live nodes) never run a job.
> However, when I remove node01 for example (by commenting out node01 in
> the nodes file and restarting pbs_server), jobs will run on node12.
> Bizarrely, node03 is then marked as ‘down’ in pbsnodes –a.
> This is long but basically I’m getting a lot of odd behavior and I’m not
> sure where to start debugging. All live nodes are running pbs_mom. The
> system was working as expected with just one compute node. With more
> than one it is having problems. I’m running pbs_sched. Can anyone please
> help?
> Cheers,
> Rob
> *Rob Holmes***
> *Environmental Scientist – Catchments and Receiving Environments***
> **
> **
> *BMT WBM Pty Ltd
> *Level 8, 200 Creek Street
> Brisbane QLD 4000 Australia
> *P: *+61 7 3831 6744
> *F: *
> *W: www.bmtwbm.com.au*
> <http://www.bmt.org/times100best>
Hi Rob

Weird indeed.
Maybe if you send more information,
it will ring a bell.

Here's a bunch of somewhat random possibilities.

1) It may help if you send the output of

qmgr -c 'p s'

of your nodes file, and

of pbsnodes -a

2) Did you set 'np=XX' on the various lines of the nodes file?

3) Any chance that the queue[s] or the server
is[are] configured with a maximum number of nodes or

4) Did you add any properties to the nodes, then
perhaps used them to restrict job access to some nodes or queues?

5) Is pbs_mom running on all the four nodes that are up?

6) Any funny stuff in /var/log/messages on the server and
perhaps on the compute nodes?

7) Likewise for $TORQUE/server_logs/YYYYMMDD [server],
$TORQUE/sched_logs/YYYYMMDD [server],
or $TORQUE/mom_logs/YYYYMMDD [compute nodes] ?

8) Any chance that the node names cannot be resolved?
This is typically done in /etc/hosts on all nodes.
Normally each node name is associated to a
private [to the cluster] subnet.

9) Are the node's [Ethernet?] interfaces up and
configured with the right IP addresses [as shown by ifconfig -a]?

10) Can you ping across every pair of nodes through the expected
route [ping -R ...]?

11) Any firewalls perhaps blocking the access to the nodes?

I hope this helps,
Gus Correa

More information about the torqueusers mailing list