[torqueusers] PBS_NODEFILE incomplete (entries for last(?) nodeonly)
Gordon Wells
gordon.wells at gmail.com
Mon Oct 11 23:45:10 MDT 2010
Hi Gus
Thanks for the info, but this doesn't seem to be related to why
$PBS_NODEFILE only ever contains the entries for one node. I can ssh as
myself and root passwordless between the headnode and compute nodes, using
short hostnames, so I don't think there is a problem there.
Kind regards
Gordon
-- max(∫(εὐδαιμονία)dt)
Dr Gordon Wells
Bioinformatics and Computational Biology Unit
Department of Biochemistry
University of Pretoria
On 11 October 2010 19:10, Gus Correa <gus at ldeo.columbia.edu> wrote:
> Gordon Wells wrote:
> > Hi
> >
> > The varies /etc/hosts, nodes, server_name and config files and seem to
> > be consistent. The nodes are indeed connected to the internet, could
> > that be problematic?
>
> Hi Gordon
>
> Yes, if the nodes are behind firewalls, or have some IP table setting
> restricting the connections.
> A firewall may prevent torque and MPI from working.
> Moreover, using the Internet addresses,
> the network traffic may hurt performance (MPI, I/O, etc).
>
> Here I (and most people) use a private subnet for this, say 192.168.1.0,
> or 10.1.1.0 either one with netmask 255.255.255.0, for this.
> Sometimes two private subnets, one for cluster control and I/O,
> another for MPI.
> Typical server motherboards come with two onboard Ethernet ports,
> but you can also plug in Gigabit Ethernet NICs on available motherboard
> slots.
> You could buy a cat5e cables and new switch for this, or if your switch
> has VLAN capability and enough idle ports,
> you can create a virtual subnet on it.
>
> On each node you have to configure these new interfaces properly,
> either through DHCP or statically (quite easy, put the IP
> addresses and the netmask on
> /etc/sysconfig/network-scripts/ifcfg-eth1, assuming eth1
> is the private subnet interface ... oh well, this is for
> RHEL/CentOS/Fedora, it may somewhat
> different in Debian/Ubuntu or SLES).
>
> Then insert names for these interfaces and associated IPs on the
> /etc/hosts files (same on all nodes).
> For instance:
>
> 192.168.1.1 node01
> ...
>
> The same names should be also used in the ${TORQUE}/server_priv/nodes file.
>
> In any case, either using the Internet or a private subnet,
> you need to make sure the users can
> ssh passwordless across all pairs of nodes.
> Can you do this on all node pairs on your cluster?
>
> This can be done, for instance, by creating a ssh-rsa key pair,
> and putting a bunch of copies of the public key on
> /etc/ssh/ssh_known_hosts2 on all nodes,
> something like this:
>
> 192.168.1.1,node01 ssh-rsa [the same ssh-rsa public key copy goes here]
> 192.168.1.2,node02 ssh-rsa [the same ssh-rsa public key copy goes here]
> ...
>
> However, you *don't want to do this with public IP addresses*,
> only with private ones.
> (Yet another issue with using the Internet for Torque and MPI.)
>
> I hope this helps,
> Gus Correa
>
>
>
> >
> > As for 5), won't that require $PBS_NODEFILE to be correctly generated?
> >
> > Regards
> > Gordon
> >
> > -- max(∫(εὐδαιμονία)dt)
> >
> > Dr Gordon Wells
> > Bioinformatics and Computational Biology Unit
> > Department of Biochemistry
> > University of Pretoria
> >
> >
> > On 8 October 2010 01:09, Gus Correa <gus at ldeo.columbia.edu
> > <mailto:gus at ldeo.columbia.edu>> wrote:
> >
> > Hi Gordon
> >
> > Some guesses:
> >
> > 1) Do you have mom daemons running on the nodes?
> > I.e. on the nodes, what is the output of "service pbs status" or
> > "service pbs_mom status"?
> >
> > 2) Do your mom daemons on the nodes point to the server?
> > I.e. what is the content of $TORQUE/mom_priv/config?
> > Is it consistent with the server name in $TORQUE/server_name ?
> >
> > 3) What is the content of your /etc/hosts file on the head node
> > and on each node?
> > Are they the same?
> > Are they consistent with your nodes file,
> > i.e. head_node:$TORQUE/server_priv/nodes (i.e. same host names
> > that have IP addresses listed in /etc/hosts)?
> >
> > 4) Are you really using the Internet to connect the nodes,
> > as the fqdn names on your nodes file (sent in an old email) suggest?
> > (I can't find it, maybe you can post it again.)
> > Or are you using a private subnet?
> >
> > 5) Did you try to run hostname via mpirun on all nodes?
> > I.e., something like this:
> >
> > ...
> > #PBS -l nodes=8:ppn=2
> > ...
> > mpirun -np 16 hostname
> >
> >
> > I hope this helps,
> > Gus Correa
> >
> > Gordon Wells wrote:
> > > I've tried that, unfortunately I never get a $PBS_NODEFILE that
> spans
> > > more than one node.
> > >
> > > -- max(∫(εὐδαιμονία)dt)
> > >
> > > Dr Gordon Wells
> > > Bioinformatics and Computational Biology Unit
> > > Department of Biochemistry
> > > University of Pretoria
> > >
> > >
> > > On 7 October 2010 10:02, Vaibhav Pol <vaibhavp at cdac.in
> > <mailto:vaibhavp at cdac.in>
> > > <mailto:vaibhavp at cdac.in <mailto:vaibhavp at cdac.in>>> wrote:
> > >
> > > Hi ,
> > > you must set server as well as queue attribute.
> > >
> > > set server resources_available.nodect = (number of
> > nodes *
> > > cpus per node)
> > > set <queue name> resources_available.nodect = (number
> of
> > > nodes * cpus per node)
> > >
> > >
> > > Thanks and regards,
> > > Vaibhav Pol
> > > National PARAM Supercomputing Facility
> > > Centre for Development of Advanced Computing
> > > Ganeshkhind Road
> > > Pune University Campus
> > > PUNE-Maharastra
> > > Phone +91-20-25704176 ext: 176
> > > Cell Phone : +919850466409
> > >
> > >
> > >
> > > On Thu, 7 Oct 2010, Gordon Wells wrote:
> > >
> > > Hi
> > >
> > > I've now tried torque 2.5.2 as well, same problems.
> > > Setting resources_available.nodect has no effect except
> > allowing
> > > me to use
> > > "-l nodes=x" with x > 14
> > >
> > > regards
> > >
> > > -- max(∫(εὐδαιμονία)dt)
> > >
> > > Dr Gordon Wells
> > > Bioinformatics and Computational Biology Unit
> > > Department of Biochemistry
> > > University of Pretoria
> > >
> > >
> > > On 6 October 2010 20:04, Glen Beane <glen.beane at gmail.com
> > <mailto:glen.beane at gmail.com>
> > > <mailto:glen.beane at gmail.com
> > <mailto:glen.beane at gmail.com>>> wrote:
> > >
> > > On Wed, Oct 6, 2010 at 1:12 PM, Gordon Wells
> > > <gordon.wells at gmail.com
> > <mailto:gordon.wells at gmail.com> <mailto:gordon.wells at gmail.com
> > <mailto:gordon.wells at gmail.com>>>
> > > wrote:
> > >
> > > Can I confirm that this will definitely fix the
> > problem?
> > > Unfortunately
> > >
> > > this
> > >
> > > cluster also needs to be glite compatible, 2.3.6
> > seems
> > > to be the latest
> > >
> > > that
> > >
> > > will work
> > >
> > >
> > >
> > > i'm not certain... do you happen to have set server
> > > resources_available.nodect set? I have seen bugs with
> > > PBS_NODEFILE
> > > contents when this server attribute is set. This may
> > be a
> > > manifestation of this bug, and I'm not sure if it has
> > been
> > > corrected.
> > >
> > > try unsetting this and submitting a job with -l
> > nodes=X:ppn=Y
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > <mailto:torqueusers at supercluster.org>
> > > <mailto:torqueusers at supercluster.org
> > <mailto:torqueusers at supercluster.org>>
> > >
> http://www.supercluster.org/mailman/listinfo/torqueusers
> > >
> > >
> > > --
> > > This message has been scanned for viruses and
> > > dangerous content by MailScanner, and is
> > > believed to be clean.
> > >
> > >
> > > --
> > > This message has been scanned for viruses and
> > > dangerous content by MailScanner, and is
> > > believed to be clean.
> > >
> > >
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > <mailto:torqueusers at supercluster.org>
> > <mailto:torqueusers at supercluster.org
> > <mailto:torqueusers at supercluster.org>>
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > >
> > >
> > >
> > >
> >
> ------------------------------------------------------------------------
> > >
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org <mailto:torqueusers at supercluster.org
> >
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20101012/64e6a3d9/attachment-0001.html
More information about the torqueusers
mailing list