[torqueusers] PBS_NODEFILE incomplete (entries for last(?) nodeonly)
Gordon Wells
gordon.wells at gmail.com
Mon Oct 11 01:49:19 MDT 2010
Hi
The varies /etc/hosts, nodes, server_name and config files and seem to be
consistent. The nodes are indeed connected to the internet, could that be
problematic?
As for 5), won't that require $PBS_NODEFILE to be correctly generated?
Regards
Gordon
-- max(∫(εὐδαιμονία)dt)
Dr Gordon Wells
Bioinformatics and Computational Biology Unit
Department of Biochemistry
University of Pretoria
On 8 October 2010 01:09, Gus Correa <gus at ldeo.columbia.edu> wrote:
> Hi Gordon
>
> Some guesses:
>
> 1) Do you have mom daemons running on the nodes?
> I.e. on the nodes, what is the output of "service pbs status" or
> "service pbs_mom status"?
>
> 2) Do your mom daemons on the nodes point to the server?
> I.e. what is the content of $TORQUE/mom_priv/config?
> Is it consistent with the server name in $TORQUE/server_name ?
>
> 3) What is the content of your /etc/hosts file on the head node
> and on each node?
> Are they the same?
> Are they consistent with your nodes file,
> i.e. head_node:$TORQUE/server_priv/nodes (i.e. same host names
> that have IP addresses listed in /etc/hosts)?
>
> 4) Are you really using the Internet to connect the nodes,
> as the fqdn names on your nodes file (sent in an old email) suggest?
> (I can't find it, maybe you can post it again.)
> Or are you using a private subnet?
>
> 5) Did you try to run hostname via mpirun on all nodes?
> I.e., something like this:
>
> ...
> #PBS -l nodes=8:ppn=2
> ...
> mpirun -np 16 hostname
>
>
> I hope this helps,
> Gus Correa
>
> Gordon Wells wrote:
> > I've tried that, unfortunately I never get a $PBS_NODEFILE that spans
> > more than one node.
> >
> > -- max(∫(εὐδαιμονία)dt)
> >
> > Dr Gordon Wells
> > Bioinformatics and Computational Biology Unit
> > Department of Biochemistry
> > University of Pretoria
> >
> >
> > On 7 October 2010 10:02, Vaibhav Pol <vaibhavp at cdac.in
> > <mailto:vaibhavp at cdac.in>> wrote:
> >
> > Hi ,
> > you must set server as well as queue attribute.
> >
> > set server resources_available.nodect = (number of nodes *
> > cpus per node)
> > set <queue name> resources_available.nodect = (number of
> > nodes * cpus per node)
> >
> >
> > Thanks and regards,
> > Vaibhav Pol
> > National PARAM Supercomputing Facility
> > Centre for Development of Advanced Computing
> > Ganeshkhind Road
> > Pune University Campus
> > PUNE-Maharastra
> > Phone +91-20-25704176 ext: 176
> > Cell Phone : +919850466409
> >
> >
> >
> > On Thu, 7 Oct 2010, Gordon Wells wrote:
> >
> > Hi
> >
> > I've now tried torque 2.5.2 as well, same problems.
> > Setting resources_available.nodect has no effect except allowing
> > me to use
> > "-l nodes=x" with x > 14
> >
> > regards
> >
> > -- max(∫(εὐδαιμονία)dt)
> >
> > Dr Gordon Wells
> > Bioinformatics and Computational Biology Unit
> > Department of Biochemistry
> > University of Pretoria
> >
> >
> > On 6 October 2010 20:04, Glen Beane <glen.beane at gmail.com
> > <mailto:glen.beane at gmail.com>> wrote:
> >
> > On Wed, Oct 6, 2010 at 1:12 PM, Gordon Wells
> > <gordon.wells at gmail.com <mailto:gordon.wells at gmail.com>>
> > wrote:
> >
> > Can I confirm that this will definitely fix the problem?
> > Unfortunately
> >
> > this
> >
> > cluster also needs to be glite compatible, 2.3.6 seems
> > to be the latest
> >
> > that
> >
> > will work
> >
> >
> >
> > i'm not certain... do you happen to have set server
> > resources_available.nodect set? I have seen bugs with
> > PBS_NODEFILE
> > contents when this server attribute is set. This may be a
> > manifestation of this bug, and I'm not sure if it has been
> > corrected.
> >
> > try unsetting this and submitting a job with -l nodes=X:ppn=Y
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > <mailto:torqueusers at supercluster.org>
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> > --
> > This message has been scanned for viruses and
> > dangerous content by MailScanner, and is
> > believed to be clean.
> >
> >
> > --
> > This message has been scanned for viruses and
> > dangerous content by MailScanner, and is
> > believed to be clean.
> >
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20101011/7b66828e/attachment.html
More information about the torqueusers
mailing list