[torqueusers] PBS_NODEFILE incomplete (entries for last(?) nodeonly)

Gordon Wells gordon.wells at gmail.com
Mon Oct 11 01:49:19 MDT 2010


Hi

The varies /etc/hosts, nodes, server_name and config files and seem to be
consistent. The nodes are indeed connected to the internet, could that be
problematic?

As for 5), won't that require $PBS_NODEFILE to be correctly generated?

Regards
Gordon

-- max(∫(εὐδαιμονία)dt)

Dr Gordon Wells
Bioinformatics and Computational Biology Unit
Department of Biochemistry
University of Pretoria


On 8 October 2010 01:09, Gus Correa <gus at ldeo.columbia.edu> wrote:

> Hi Gordon
>
> Some guesses:
>
> 1) Do you have mom daemons running on the nodes?
> I.e. on the nodes, what is the output of "service pbs status" or
> "service pbs_mom status"?
>
> 2) Do your mom daemons on the nodes point to the server?
> I.e. what is the content of $TORQUE/mom_priv/config?
> Is it consistent with the server name in $TORQUE/server_name ?
>
> 3) What is the content of your /etc/hosts file on the head node
> and on each node?
> Are they the same?
> Are they consistent with your nodes file,
> i.e. head_node:$TORQUE/server_priv/nodes (i.e. same host names
> that have IP addresses listed in /etc/hosts)?
>
> 4) Are you really using the Internet to connect the nodes,
> as the fqdn names on your nodes file (sent in an old email) suggest?
> (I can't find it, maybe you can post it again.)
> Or are you using a private subnet?
>
> 5) Did you try to run hostname via mpirun on all nodes?
> I.e., something like this:
>
> ...
> #PBS -l nodes=8:ppn=2
> ...
> mpirun -np 16 hostname
>
>
> I hope this helps,
> Gus Correa
>
> Gordon Wells wrote:
> > I've tried that, unfortunately I never get a $PBS_NODEFILE that spans
> > more than one node.
> >
> > -- max(∫(εὐδαιμονία)dt)
> >
> > Dr Gordon Wells
> > Bioinformatics and Computational Biology Unit
> > Department of Biochemistry
> > University of Pretoria
> >
> >
> > On 7 October 2010 10:02, Vaibhav Pol <vaibhavp at cdac.in
> > <mailto:vaibhavp at cdac.in>> wrote:
> >
> >      Hi ,
> >      you must set server as well as queue attribute.
> >
> >             set server resources_available.nodect = (number of  nodes *
> >     cpus per node)
> >             set <queue name> resources_available.nodect = (number of
> >      nodes * cpus per node)
> >
> >
> >      Thanks and regards,
> >      Vaibhav Pol
> >      National PARAM Supercomputing Facility
> >      Centre for Development of Advanced Computing
> >      Ganeshkhind Road
> >      Pune University Campus
> >      PUNE-Maharastra
> >      Phone +91-20-25704176 ext: 176
> >      Cell Phone :  +919850466409
> >
> >
> >
> >     On Thu, 7 Oct 2010, Gordon Wells wrote:
> >
> >         Hi
> >
> >         I've now tried torque 2.5.2 as well, same problems.
> >         Setting resources_available.nodect has no effect except allowing
> >         me to use
> >         "-l nodes=x" with x > 14
> >
> >         regards
> >
> >         -- max(∫(εὐδαιμονία)dt)
> >
> >         Dr Gordon Wells
> >         Bioinformatics and Computational Biology Unit
> >         Department of Biochemistry
> >         University of Pretoria
> >
> >
> >         On 6 October 2010 20:04, Glen Beane <glen.beane at gmail.com
> >         <mailto:glen.beane at gmail.com>> wrote:
> >
> >             On Wed, Oct 6, 2010 at 1:12 PM, Gordon Wells
> >             <gordon.wells at gmail.com <mailto:gordon.wells at gmail.com>>
> >             wrote:
> >
> >                 Can I confirm that this will definitely fix the problem?
> >                 Unfortunately
> >
> >             this
> >
> >                 cluster also needs to be glite compatible, 2.3.6 seems
> >                 to be the latest
> >
> >             that
> >
> >                 will work
> >
> >
> >
> >             i'm not certain...  do you happen to have set server
> >             resources_available.nodect set?  I have seen bugs with
> >             PBS_NODEFILE
> >             contents when this server attribute is set.  This may be a
> >             manifestation of this bug, and I'm not sure if it has been
> >             corrected.
> >
> >             try unsetting this and submitting a job with -l nodes=X:ppn=Y
> >             _______________________________________________
> >             torqueusers mailing list
> >             torqueusers at supercluster.org
> >             <mailto:torqueusers at supercluster.org>
> >             http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >         --
> >         This message has been scanned for viruses and
> >         dangerous content by MailScanner, and is
> >         believed to be clean.
> >
> >
> >     --
> >     This message has been scanned for viruses and
> >     dangerous content by MailScanner, and is
> >     believed to be clean.
> >
> >
> >     _______________________________________________
> >     torqueusers mailing list
> >     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
> >     http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20101011/7b66828e/attachment.html 


More information about the torqueusers mailing list