[torqueusers] PBS_NODEFILE incomplete (entries for last(?) nodeonly)

Gordon Wells gordon.wells at gmail.com
Mon Oct 11 23:45:10 MDT 2010


Hi Gus

Thanks for the info, but this doesn't seem to be related to why
$PBS_NODEFILE only ever contains the entries for one node. I can ssh as
myself and root passwordless between the headnode and compute nodes, using
short hostnames, so I don't think there is a problem there.

Kind regards
Gordon

-- max(∫(εὐδαιμονία)dt)

Dr Gordon Wells
Bioinformatics and Computational Biology Unit
Department of Biochemistry
University of Pretoria


On 11 October 2010 19:10, Gus Correa <gus at ldeo.columbia.edu> wrote:

> Gordon Wells wrote:
> > Hi
> >
> > The varies /etc/hosts, nodes, server_name and config files and seem to
> > be consistent. The nodes are indeed connected to the internet, could
> > that be problematic?
>
> Hi Gordon
>
> Yes, if the nodes are behind firewalls, or have some IP table setting
> restricting the connections.
> A firewall may prevent torque and MPI from working.
> Moreover, using the Internet addresses,
> the network traffic may hurt performance (MPI, I/O, etc).
>
> Here I (and most people) use a private subnet for this, say 192.168.1.0,
> or 10.1.1.0 either one with netmask 255.255.255.0, for this.
> Sometimes two private subnets, one for cluster control and I/O,
> another for MPI.
> Typical server motherboards come with two onboard Ethernet ports,
> but you can also plug in Gigabit Ethernet NICs on available motherboard
> slots.
> You could buy a cat5e cables and new switch for this, or if your switch
> has VLAN capability and enough idle ports,
> you can create a virtual subnet on it.
>
> On each node you have to configure these new interfaces properly,
> either through DHCP or statically (quite easy, put the IP
> addresses and the netmask on
> /etc/sysconfig/network-scripts/ifcfg-eth1, assuming eth1
> is the private subnet interface ... oh well, this is for
> RHEL/CentOS/Fedora, it may somewhat
> different in Debian/Ubuntu or SLES).
>
> Then insert names for these interfaces and associated IPs on the
> /etc/hosts files (same on all nodes).
> For instance:
>
> 192.168.1.1 node01
> ...
>
> The same names should be also used in the ${TORQUE}/server_priv/nodes file.
>
> In any case, either using the Internet or a private subnet,
> you need to make sure the users can
> ssh passwordless across all pairs of nodes.
> Can you do this on all node pairs on your cluster?
>
> This can be done, for instance, by creating a ssh-rsa key pair,
> and putting a bunch of copies of the public key on
> /etc/ssh/ssh_known_hosts2 on all nodes,
> something like this:
>
> 192.168.1.1,node01 ssh-rsa [the same ssh-rsa public key copy goes here]
> 192.168.1.2,node02 ssh-rsa [the same ssh-rsa public key copy goes here]
> ...
>
> However, you *don't want to do this with public IP addresses*,
> only with private ones.
> (Yet another issue with using the Internet for Torque and MPI.)
>
> I hope this helps,
> Gus Correa
>
>
>
> >
> > As for 5), won't that require $PBS_NODEFILE to be correctly generated?
> >
> > Regards
> > Gordon
> >
> > -- max(∫(εὐδαιμονία)dt)
> >
> > Dr Gordon Wells
> > Bioinformatics and Computational Biology Unit
> > Department of Biochemistry
> > University of Pretoria
> >
> >
> > On 8 October 2010 01:09, Gus Correa <gus at ldeo.columbia.edu
> > <mailto:gus at ldeo.columbia.edu>> wrote:
> >
> >     Hi Gordon
> >
> >     Some guesses:
> >
> >     1) Do you have mom daemons running on the nodes?
> >     I.e. on the nodes, what is the output of "service pbs status" or
> >     "service pbs_mom status"?
> >
> >     2) Do your mom daemons on the nodes point to the server?
> >     I.e. what is the content of $TORQUE/mom_priv/config?
> >     Is it consistent with the server name in $TORQUE/server_name ?
> >
> >     3) What is the content of your /etc/hosts file on the head node
> >     and on each node?
> >     Are they the same?
> >     Are they consistent with your nodes file,
> >     i.e. head_node:$TORQUE/server_priv/nodes (i.e. same host names
> >     that have IP addresses listed in /etc/hosts)?
> >
> >     4) Are you really using the Internet to connect the nodes,
> >     as the fqdn names on your nodes file (sent in an old email) suggest?
> >     (I can't find it, maybe you can post it again.)
> >     Or are you using a private subnet?
> >
> >     5) Did you try to run hostname via mpirun on all nodes?
> >     I.e., something like this:
> >
> >     ...
> >     #PBS -l nodes=8:ppn=2
> >     ...
> >     mpirun -np 16 hostname
> >
> >
> >     I hope this helps,
> >     Gus Correa
> >
> >     Gordon Wells wrote:
> >      > I've tried that, unfortunately I never get a $PBS_NODEFILE that
> spans
> >      > more than one node.
> >      >
> >      > -- max(∫(εὐδαιμονία)dt)
> >      >
> >      > Dr Gordon Wells
> >      > Bioinformatics and Computational Biology Unit
> >      > Department of Biochemistry
> >      > University of Pretoria
> >      >
> >      >
> >      > On 7 October 2010 10:02, Vaibhav Pol <vaibhavp at cdac.in
> >     <mailto:vaibhavp at cdac.in>
> >      > <mailto:vaibhavp at cdac.in <mailto:vaibhavp at cdac.in>>> wrote:
> >      >
> >      >      Hi ,
> >      >      you must set server as well as queue attribute.
> >      >
> >      >             set server resources_available.nodect = (number of
> >      nodes *
> >      >     cpus per node)
> >      >             set <queue name> resources_available.nodect = (number
> of
> >      >      nodes * cpus per node)
> >      >
> >      >
> >      >      Thanks and regards,
> >      >      Vaibhav Pol
> >      >      National PARAM Supercomputing Facility
> >      >      Centre for Development of Advanced Computing
> >      >      Ganeshkhind Road
> >      >      Pune University Campus
> >      >      PUNE-Maharastra
> >      >      Phone +91-20-25704176 ext: 176
> >      >      Cell Phone :  +919850466409
> >      >
> >      >
> >      >
> >      >     On Thu, 7 Oct 2010, Gordon Wells wrote:
> >      >
> >      >         Hi
> >      >
> >      >         I've now tried torque 2.5.2 as well, same problems.
> >      >         Setting resources_available.nodect has no effect except
> >     allowing
> >      >         me to use
> >      >         "-l nodes=x" with x > 14
> >      >
> >      >         regards
> >      >
> >      >         -- max(∫(εὐδαιμονία)dt)
> >      >
> >      >         Dr Gordon Wells
> >      >         Bioinformatics and Computational Biology Unit
> >      >         Department of Biochemistry
> >      >         University of Pretoria
> >      >
> >      >
> >      >         On 6 October 2010 20:04, Glen Beane <glen.beane at gmail.com
> >     <mailto:glen.beane at gmail.com>
> >      >         <mailto:glen.beane at gmail.com
> >     <mailto:glen.beane at gmail.com>>> wrote:
> >      >
> >      >             On Wed, Oct 6, 2010 at 1:12 PM, Gordon Wells
> >      >             <gordon.wells at gmail.com
> >     <mailto:gordon.wells at gmail.com> <mailto:gordon.wells at gmail.com
> >     <mailto:gordon.wells at gmail.com>>>
> >      >             wrote:
> >      >
> >      >                 Can I confirm that this will definitely fix the
> >     problem?
> >      >                 Unfortunately
> >      >
> >      >             this
> >      >
> >      >                 cluster also needs to be glite compatible, 2.3.6
> >     seems
> >      >                 to be the latest
> >      >
> >      >             that
> >      >
> >      >                 will work
> >      >
> >      >
> >      >
> >      >             i'm not certain...  do you happen to have set server
> >      >             resources_available.nodect set?  I have seen bugs with
> >      >             PBS_NODEFILE
> >      >             contents when this server attribute is set.  This may
> >     be a
> >      >             manifestation of this bug, and I'm not sure if it has
> >     been
> >      >             corrected.
> >      >
> >      >             try unsetting this and submitting a job with -l
> >     nodes=X:ppn=Y
> >      >             _______________________________________________
> >      >             torqueusers mailing list
> >      >             torqueusers at supercluster.org
> >     <mailto:torqueusers at supercluster.org>
> >      >             <mailto:torqueusers at supercluster.org
> >     <mailto:torqueusers at supercluster.org>>
> >      >
> http://www.supercluster.org/mailman/listinfo/torqueusers
> >      >
> >      >
> >      >         --
> >      >         This message has been scanned for viruses and
> >      >         dangerous content by MailScanner, and is
> >      >         believed to be clean.
> >      >
> >      >
> >      >     --
> >      >     This message has been scanned for viruses and
> >      >     dangerous content by MailScanner, and is
> >      >     believed to be clean.
> >      >
> >      >
> >      >     _______________________________________________
> >      >     torqueusers mailing list
> >      >     torqueusers at supercluster.org
> >     <mailto:torqueusers at supercluster.org>
> >     <mailto:torqueusers at supercluster.org
> >     <mailto:torqueusers at supercluster.org>>
> >      >     http://www.supercluster.org/mailman/listinfo/torqueusers
> >      >
> >      >
> >      >
> >      >
> >
> ------------------------------------------------------------------------
> >      >
> >      > _______________________________________________
> >      > torqueusers mailing list
> >      > torqueusers at supercluster.org <mailto:torqueusers at supercluster.org
> >
> >      > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >     _______________________________________________
> >     torqueusers mailing list
> >     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
> >     http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20101012/64e6a3d9/attachment-0001.html 


More information about the torqueusers mailing list