[torqueusers] nodes file - basic install problems

Gus Correa gus at ldeo.columbia.edu
Fri Jul 6 09:30:07 MDT 2012


On 07/06/2012 12:58 AM, Rob Holmes wrote:
> Hi,
>
> I’m installing a small HPC cluster at work, which I’ve never done before
> and it’s causing me problems.
>
> My nodes file contains 14 compute nodes named node01, node02, etc. At
> the moment I just have four nodes switched on, with the remainder shown
> as ‘down’ with pbsnodes -a. When I submit a number of jobs, jobs are
> submitted to the first two nodes with the remaining two marked as
> ‘free’, regardless of how many jobs are waiting to be submitted. Jobs
> are kept in the queue until either of node01 or node02 come free, then
> are run. node03 and node12 (the other two live nodes) never run a job.
>
> However, when I remove node01 for example (by commenting out node01 in
> the nodes file and restarting pbs_server), jobs will run on node12.
> Bizarrely, node03 is then marked as ‘down’ in pbsnodes –a.
>
> This is long but basically I’m getting a lot of odd behavior and I’m not
> sure where to start debugging. All live nodes are running pbs_mom. The
> system was working as expected with just one compute node. With more
> than one it is having problems. I’m running pbs_sched. Can anyone please
> help?
>
> Cheers,
>
> Rob
>
> *Rob Holmes***
>
> *Environmental Scientist – Catchments and Receiving Environments***
>
> **
>
> **
>
> *BMT WBM Pty Ltd
> *Level 8, 200 Creek Street
> Brisbane QLD 4000 Australia
> *P: *+61 7 3831 6744
> *F: *
> *W: www.bmtwbm.com.au*
> <http://www.bmt.org/times100best>
>
Hi Rob

Weird indeed.
Maybe if you send more information,
it will ring a bell.

Here's a bunch of somewhat random possibilities.

1) It may help if you send the output of

qmgr -c 'p s'

of your nodes file, and

of pbsnodes -a

2) Did you set 'np=XX' on the various lines of the nodes file?

3) Any chance that the queue[s] or the server
is[are] configured with a maximum number of nodes or
jobs?

4) Did you add any properties to the nodes, then
perhaps used them to restrict job access to some nodes or queues?

5) Is pbs_mom running on all the four nodes that are up?

6) Any funny stuff in /var/log/messages on the server and
perhaps on the compute nodes?

7) Likewise for $TORQUE/server_logs/YYYYMMDD [server],
$TORQUE/sched_logs/YYYYMMDD [server],
or $TORQUE/mom_logs/YYYYMMDD [compute nodes] ?

8) Any chance that the node names cannot be resolved?
This is typically done in /etc/hosts on all nodes.
Normally each node name is associated to a
private [to the cluster] subnet.

9) Are the node's [Ethernet?] interfaces up and
configured with the right IP addresses [as shown by ifconfig -a]?

10) Can you ping across every pair of nodes through the expected
route [ping -R ...]?

11) Any firewalls perhaps blocking the access to the nodes?

I hope this helps,
Gus Correa


More information about the torqueusers mailing list