[torqueusers] PBS_NODEFILE incomplete (entries for last(?) nodeonly)

Gus Correa gus at ldeo.columbia.edu
Mon Oct 11 11:10:09 MDT 2010


Gordon Wells wrote:
> Hi
> 
> The varies /etc/hosts, nodes, server_name and config files and seem to 
> be consistent. The nodes are indeed connected to the internet, could 
> that be problematic?

Hi Gordon

Yes, if the nodes are behind firewalls, or have some IP table setting 
restricting the connections.
A firewall may prevent torque and MPI from working.
Moreover, using the Internet addresses,
the network traffic may hurt performance (MPI, I/O, etc).

Here I (and most people) use a private subnet for this, say 192.168.1.0, 
or 10.1.1.0 either one with netmask 255.255.255.0, for this.
Sometimes two private subnets, one for cluster control and I/O,
another for MPI.
Typical server motherboards come with two onboard Ethernet ports,
but you can also plug in Gigabit Ethernet NICs on available motherboard 
slots.
You could buy a cat5e cables and new switch for this, or if your switch 
has VLAN capability and enough idle ports,
you can create a virtual subnet on it.

On each node you have to configure these new interfaces properly,
either through DHCP or statically (quite easy, put the IP
addresses and the netmask on
/etc/sysconfig/network-scripts/ifcfg-eth1, assuming eth1
is the private subnet interface ... oh well, this is for 
RHEL/CentOS/Fedora, it may somewhat
different in Debian/Ubuntu or SLES).

Then insert names for these interfaces and associated IPs on the 
/etc/hosts files (same on all nodes).
For instance:

192.168.1.1 node01
...

The same names should be also used in the ${TORQUE}/server_priv/nodes file.

In any case, either using the Internet or a private subnet,
you need to make sure the users can
ssh passwordless across all pairs of nodes.
Can you do this on all node pairs on your cluster?

This can be done, for instance, by creating a ssh-rsa key pair,
and putting a bunch of copies of the public key on
/etc/ssh/ssh_known_hosts2 on all nodes,
something like this:

192.168.1.1,node01 ssh-rsa [the same ssh-rsa public key copy goes here]
192.168.1.2,node02 ssh-rsa [the same ssh-rsa public key copy goes here]
...

However, you *don't want to do this with public IP addresses*,
only with private ones.
(Yet another issue with using the Internet for Torque and MPI.)

I hope this helps,
Gus Correa



> 
> As for 5), won't that require $PBS_NODEFILE to be correctly generated?
> 
> Regards
> Gordon
> 
> -- max(∫(εὐδαιμονία)dt)
> 
> Dr Gordon Wells
> Bioinformatics and Computational Biology Unit
> Department of Biochemistry
> University of Pretoria
> 
> 
> On 8 October 2010 01:09, Gus Correa <gus at ldeo.columbia.edu 
> <mailto:gus at ldeo.columbia.edu>> wrote:
> 
>     Hi Gordon
> 
>     Some guesses:
> 
>     1) Do you have mom daemons running on the nodes?
>     I.e. on the nodes, what is the output of "service pbs status" or
>     "service pbs_mom status"?
> 
>     2) Do your mom daemons on the nodes point to the server?
>     I.e. what is the content of $TORQUE/mom_priv/config?
>     Is it consistent with the server name in $TORQUE/server_name ?
> 
>     3) What is the content of your /etc/hosts file on the head node
>     and on each node?
>     Are they the same?
>     Are they consistent with your nodes file,
>     i.e. head_node:$TORQUE/server_priv/nodes (i.e. same host names
>     that have IP addresses listed in /etc/hosts)?
> 
>     4) Are you really using the Internet to connect the nodes,
>     as the fqdn names on your nodes file (sent in an old email) suggest?
>     (I can't find it, maybe you can post it again.)
>     Or are you using a private subnet?
> 
>     5) Did you try to run hostname via mpirun on all nodes?
>     I.e., something like this:
> 
>     ...
>     #PBS -l nodes=8:ppn=2
>     ...
>     mpirun -np 16 hostname
> 
> 
>     I hope this helps,
>     Gus Correa
> 
>     Gordon Wells wrote:
>      > I've tried that, unfortunately I never get a $PBS_NODEFILE that spans
>      > more than one node.
>      >
>      > -- max(∫(εὐδαιμονία)dt)
>      >
>      > Dr Gordon Wells
>      > Bioinformatics and Computational Biology Unit
>      > Department of Biochemistry
>      > University of Pretoria
>      >
>      >
>      > On 7 October 2010 10:02, Vaibhav Pol <vaibhavp at cdac.in
>     <mailto:vaibhavp at cdac.in>
>      > <mailto:vaibhavp at cdac.in <mailto:vaibhavp at cdac.in>>> wrote:
>      >
>      >      Hi ,
>      >      you must set server as well as queue attribute.
>      >
>      >             set server resources_available.nodect = (number of
>      nodes *
>      >     cpus per node)
>      >             set <queue name> resources_available.nodect = (number of
>      >      nodes * cpus per node)
>      >
>      >
>      >      Thanks and regards,
>      >      Vaibhav Pol
>      >      National PARAM Supercomputing Facility
>      >      Centre for Development of Advanced Computing
>      >      Ganeshkhind Road
>      >      Pune University Campus
>      >      PUNE-Maharastra
>      >      Phone +91-20-25704176 ext: 176
>      >      Cell Phone :  +919850466409
>      >
>      >
>      >
>      >     On Thu, 7 Oct 2010, Gordon Wells wrote:
>      >
>      >         Hi
>      >
>      >         I've now tried torque 2.5.2 as well, same problems.
>      >         Setting resources_available.nodect has no effect except
>     allowing
>      >         me to use
>      >         "-l nodes=x" with x > 14
>      >
>      >         regards
>      >
>      >         -- max(∫(εὐδαιμονία)dt)
>      >
>      >         Dr Gordon Wells
>      >         Bioinformatics and Computational Biology Unit
>      >         Department of Biochemistry
>      >         University of Pretoria
>      >
>      >
>      >         On 6 October 2010 20:04, Glen Beane <glen.beane at gmail.com
>     <mailto:glen.beane at gmail.com>
>      >         <mailto:glen.beane at gmail.com
>     <mailto:glen.beane at gmail.com>>> wrote:
>      >
>      >             On Wed, Oct 6, 2010 at 1:12 PM, Gordon Wells
>      >             <gordon.wells at gmail.com
>     <mailto:gordon.wells at gmail.com> <mailto:gordon.wells at gmail.com
>     <mailto:gordon.wells at gmail.com>>>
>      >             wrote:
>      >
>      >                 Can I confirm that this will definitely fix the
>     problem?
>      >                 Unfortunately
>      >
>      >             this
>      >
>      >                 cluster also needs to be glite compatible, 2.3.6
>     seems
>      >                 to be the latest
>      >
>      >             that
>      >
>      >                 will work
>      >
>      >
>      >
>      >             i'm not certain...  do you happen to have set server
>      >             resources_available.nodect set?  I have seen bugs with
>      >             PBS_NODEFILE
>      >             contents when this server attribute is set.  This may
>     be a
>      >             manifestation of this bug, and I'm not sure if it has
>     been
>      >             corrected.
>      >
>      >             try unsetting this and submitting a job with -l
>     nodes=X:ppn=Y
>      >             _______________________________________________
>      >             torqueusers mailing list
>      >             torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>      >             <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>
>      >             http://www.supercluster.org/mailman/listinfo/torqueusers
>      >
>      >
>      >         --
>      >         This message has been scanned for viruses and
>      >         dangerous content by MailScanner, and is
>      >         believed to be clean.
>      >
>      >
>      >     --
>      >     This message has been scanned for viruses and
>      >     dangerous content by MailScanner, and is
>      >     believed to be clean.
>      >
>      >
>      >     _______________________________________________
>      >     torqueusers mailing list
>      >     torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>     <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>
>      >     http://www.supercluster.org/mailman/listinfo/torqueusers
>      >
>      >
>      >
>      >
>     ------------------------------------------------------------------------
>      >
>      > _______________________________________________
>      > torqueusers mailing list
>      > torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>      > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
>     _______________________________________________
>     torqueusers mailing list
>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>     http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list