[torqueusers] nodes file - basic install problems

Rob Holmes Rob.Holmes at bmtwbm.com.au
Sun Jul 8 16:23:48 MDT 2012


Hi Gus,

Thanks for your reply, especially as my post wasn't particularly informative!

Checking out the server logs showed me the way.  There were two kinds of errors for node03: a warning of 'no route to host' and an error of 'connection to node03 is bad'.  The problem was that node03 had its firewall up, and also that SElinux was switched on.  I'm not sure why torque threw an error on some occasions but only a warning on others.  As a double whammy, node12 had SElinux switched on too.  I had switched all these off but forgot to make it permanent so they didn't survive a reboot.

All is as it should be now and the system is working as expected.

Thanks for your help,
Rob



BMT WBM Pty Ltd
Level 8, 200 Creek Street
Brisbane QLD 4000 Australia
P: +61 7 3831 6744
F:

W: www.bmtwbm.com.au

E-mail confidentiality notice and disclaimer:
The contents of this e-mail are intended for the use of the mail addressee(s) shown.  If you are not that person, you are not allowed to read, action, copy, forward, distribute or disclose the contents and you should delete it from your system.  BMT WBM accepts no liability for any errors or omissions in the content of this e-mail, nor does it accept liability for statements which are those of the
author and clearly not made on behalf of the company.

Commercial Terms and Conditions:
Unless otherwise agreed by BMT WBM in writing, all services or products supplied by BMT WBM shall be subject to and governed by BMT WBM's standard terms and conditions, which are available upon
request.

-----Original Message-----
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Gus Correa
Sent: Saturday, 7 July 2012 01:30 AM
To: Torque Users Mailing List
Subject: Re: [torqueusers] nodes file - basic install problems

On 07/06/2012 12:58 AM, Rob Holmes wrote:
> Hi,
>
> I'm installing a small HPC cluster at work, which I've never done
> before and it's causing me problems.
>
> My nodes file contains 14 compute nodes named node01, node02, etc. At
> the moment I just have four nodes switched on, with the remainder
> shown as 'down' with pbsnodes -a. When I submit a number of jobs, jobs
> are submitted to the first two nodes with the remaining two marked as
> 'free', regardless of how many jobs are waiting to be submitted. Jobs
> are kept in the queue until either of node01 or node02 come free, then
> are run. node03 and node12 (the other two live nodes) never run a job.
>
> However, when I remove node01 for example (by commenting out node01 in
> the nodes file and restarting pbs_server), jobs will run on node12.
> Bizarrely, node03 is then marked as 'down' in pbsnodes -a.
>
> This is long but basically I'm getting a lot of odd behavior and I'm
> not sure where to start debugging. All live nodes are running pbs_mom.
> The system was working as expected with just one compute node. With
> more than one it is having problems. I'm running pbs_sched. Can anyone
> please help?
>
> Cheers,
>
> Rob
>
> *Rob Holmes***
>
> *Environmental Scientist - Catchments and Receiving Environments***
>
> **
>
> **
>
> *BMT WBM Pty Ltd
> *Level 8, 200 Creek Street
> Brisbane QLD 4000 Australia
> *P: *+61 7 3831 6744
> *F: *
> *W: www.bmtwbm.com.au*
> <http://www.bmt.org/times100best>
>
Hi Rob

Weird indeed.
Maybe if you send more information,
it will ring a bell.

Here's a bunch of somewhat random possibilities.

1) It may help if you send the output of

qmgr -c 'p s'

of your nodes file, and

of pbsnodes -a

2) Did you set 'np=XX' on the various lines of the nodes file?

3) Any chance that the queue[s] or the server is[are] configured with a maximum number of nodes or jobs?

4) Did you add any properties to the nodes, then perhaps used them to restrict job access to some nodes or queues?

5) Is pbs_mom running on all the four nodes that are up?

6) Any funny stuff in /var/log/messages on the server and perhaps on the compute nodes?

7) Likewise for $TORQUE/server_logs/YYYYMMDD [server], $TORQUE/sched_logs/YYYYMMDD [server], or $TORQUE/mom_logs/YYYYMMDD [compute nodes] ?

8) Any chance that the node names cannot be resolved?
This is typically done in /etc/hosts on all nodes.
Normally each node name is associated to a private [to the cluster] subnet.

9) Are the node's [Ethernet?] interfaces up and configured with the right IP addresses [as shown by ifconfig -a]?

10) Can you ping across every pair of nodes through the expected route [ping -R ...]?

11) Any firewalls perhaps blocking the access to the nodes?

I hope this helps,
Gus Correa
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list