[torqueusers] specifying nodes for MPI jobs on small cluster

Gus Correa gus at ldeo.columbia.edu
Mon Feb 11 09:02:00 MST 2013


Thank you, Andrew!
That is good news.
Why Torque mixes the concepts of "node" and "processor"
still doesn't make much sense to me, but that's the way it is.
Gus

On 02/11/2013 04:02 AM, Andrew Dawson wrote:
> Hi Gus,
>
> If I start enough jobs to fill all available CPUs plus a few extra then
> torque will launch enough jobs to fill all available CPUs and queue the
> remainder until a CPU becomes available. So everything is fine, that is
> what I would expect.
>
> Andrew
>
>
> On 8 February 2013 16:29, Gustavo Correa <gus at ldeo.columbia.edu
> <mailto:gus at ldeo.columbia.edu>> wrote:
>
>     Thank you Adrew.
>
>     What happens if you launch enough jobs to fill out enough "nodes"
>     that were assigned to resources_available.nodes (server and queues),
>     but that do not actually exist physically (or as described by the
>     nodes *file* in
>     the server_priv directory)?
>
>     For instance, if you set resources_available.nodes=41 (the number of
>     cores/processors,
>     not actual nodes in your cluster, IIRR), then launch enough jobs to
>     fill 41 nodes,
>     will all those jobs run, or will the actual nodes information stored
>     in the nodes
>     *file* take precedence?
>     I.e. will the physical nodes-and-processors get oversubscribed,
>     or will some jobs sit and wait in Q (queued) state,
>     and Torque will only  run (R state) enough jobs to
>     fill the physical nodes-and-cores available?
>
>     I confess I find the context-dependent intepretation of the word "node"
>     by Torque more harmful than helpful.
>     It may also interact in unclear ways with
>     the scheduler settings (eg. Maui).
>     Maybe the context-dependent interpretation is to keep up with legacy
>     interpretations.
>     I would rather like a more rigid (and hopefully less confusing)
>     notion of what a node and a processor are, even with the current
>     blurring of  the
>     dividing line by multicore processors, gpus, etc.
>
>     I cannot test this now.  The cluster is a production machine,
>     and now it is down due to a blizzard here.
>
>     Thank you,
>     Gus Correa
>
>
>
>
>     On Feb 8, 2013, at 11:04 AM, Andrew Dawson wrote:
>
>      > For others who are interested, the guidance at
>     http://docs.adaptivecomputing.com/torque/Content/topics/11-troubleshooting/faq.htm#qsubNotAllow
>     resolves my particular issue, so thanks Michel!
>      >
>      >
>      > On 7 February 2013 21:40, Gus Correa <gus at ldeo.columbia.edu
>     <mailto:gus at ldeo.columbia.edu>> wrote:
>      > Hi Andrew
>      >
>      > I never got much luck with procs=YZ,
>      > which is likely to be the syntax that matches what you want to do.
>      > Maui (the scheduler I use) seems not to understand that
>      > syntax very well.
>      >
>      > I wouldn't rely completely on the Torque documentation.
>      > It has good guidelines, but may have mistakes in the details.
>      > Trial and error may be the way to check what works for you.
>      > I wonder if the error message you see may come
>      > from different interpretations given to the word "node"
>      > by the torque server (pbs_server) and the scheduler (which
>      > maybe Maui, pbs_sched or perhaps Moab).
>      >
>      > If you want also to control to which nodes
>      > (and sockets and cores) each MPI *process* is sent to,
>      > I suggest that you build OpenMPI with Torque support.
>      > OpenMPI when built with Torque support
>      > will use the nodes and processors assigned
>      > by Torque to that job,
>      > but you can still decide how the sockets and
>      > cores are distributed among the various MPI processes,
>      > through switches to mpiexec such as --bynode, --bysocket,
>      > --bycore, or even finer control through their "rankfiles".
>      >
>      > I hope this helps,
>      > Gus Correa
>      >
>      > On 02/07/2013 03:54 PM, Andrew Dawson wrote:
>      > > Hi Gus,
>      > >
>      > > Yes I can do that. What I would like to do is be able to have users
>      > > request the number of CPUs for an MPI job and not have to care
>     how these
>      > > CPUs are distributed across physical nodes. If I do
>      > >
>      > > #PBS -l nodes=1:ppn=8
>      > >
>      > > then this will mean the job has to wait until there are 8 CPUs
>     on one
>      > > physical node before starting, correct?
>      > >
>      > >  From the torque documentation, it seems to say I can do:
>      > >
>      > > #PBS -l nodes=8
>      > >
>      > > and this will be interpreted as 8 CPUs rather than 8 physical
>     nodes.
>      > > This is what I want. Unfortunately I get the error message at
>     submission
>      > > time saying there are not enough resources to fulfill this
>     request, even
>      > > though there are 33 CPUs in the system. If on my system I do
>      > >
>      > > #PBS -l nodes=5
>      > >
>      > > then my MPI job gets sent to 5 CPUs, not necessarily on the same
>      > > physical node, which is great and exactly what I want. I would
>     therefore
>      > > expect this to work for larger numbers but it seems that at
>     submission
>      > > time the request is checked against the number of physical
>     nodes rather
>      > > than virtual processors, meaning I cannot do this! It is quite
>     frustrating.
>      > >
>      > > Please ask if there is further clarification I can make.
>      > >
>      > > Andrew
>      > >
>      > >
>      > > On 7 February 2013 19:28, Gus Correa <gus at ldeo.columbia.edu
>     <mailto:gus at ldeo.columbia.edu>
>      > > <mailto:gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>>>
>     wrote:
>      > >
>      > >     Hi Andrew
>      > >
>      > >     Not sure I understood what exactly you want to do,
>      > >     but have you tried this?
>      > >
>      > >     #PBS -l nodes=1:ppn=8
>      > >
>      > >
>      > >     It will request one node with 8 processors.
>      > >
>      > >     I hope this helps,
>      > >     Gus Correa
>      > >
>      > >     On 02/07/2013 11:38 AM, Andrew Dawson wrote:
>      > > > Nodes file looks like this:
>      > > >
>      > > > cirrus np=1
>      > > > cirrus1 np=8
>      > > > cirrus2 np=8
>      > > > cirrus3 np=8
>      > > > cirrus4 np=8
>      > > >
>      > > > On 7 Feb 2013 16:25, "Ricardo Román Brenes"
>      > > <roman.ricardo at gmail.com <mailto:roman.ricardo at gmail.com>
>     <mailto:roman.ricardo at gmail.com <mailto:roman.ricardo at gmail.com>>
>      > > > <mailto:roman.ricardo at gmail.com <mailto:roman.ricardo at gmail.com>
>      > > <mailto:roman.ricardo at gmail.com
>     <mailto:roman.ricardo at gmail.com>>>> wrote:
>      > > >
>      > > >     hi!
>      > > >
>      > > >     How does your node config file looks like?
>      > > >
>      > > >     On Thu, Feb 7, 2013 at 3:10 AM, Andrew Dawson
>      > > <dawson at atm.ox.ac.uk <mailto:dawson at atm.ox.ac.uk>
>     <mailto:dawson at atm.ox.ac.uk <mailto:dawson at atm.ox.ac.uk>>
>      > > > <mailto:dawson at atm.ox.ac.uk <mailto:dawson at atm.ox.ac.uk>
>     <mailto:dawson at atm.ox.ac.uk <mailto:dawson at atm.ox.ac.uk>>>> wrote:
>      > > >
>      > > >         Hi all,
>      > > >
>      > > >         I'm configuring a recent torque/maui installation and I'm
>      > >     having
>      > > >         trouble with submitting MPI jobs. I would like for MPI
>      > >     jobs to
>      > > >         specify the number of processors they require and
>     have those
>      > > >         come from any available physical machine, the users
>     shouldn't
>      > > >         need to specify processors per node etc.
>      > > >
>      > > >         The torque manual says that the nodes option is mapped to
>      > > >         virtual processors, so for example:
>      > > >
>      > > >              #PBS -l nodes=8
>      > > >
>      > > >         should request 8 virtual processors. The problem I'm
>      > >     having is
>      > > >         that our cluster currently has only 5 physical machines
>      > >     (nodes),
>      > > >         and setting nodes to anything greater than 5 gives
>     the error:
>      > > >
>      > > >              qsub: Job exceeds queue resource limits MSG=cannot
>      > >     locate
>      > > >         feasible nodes (nodes file is empty or all systems
>     are busy)
>      > > >
>      > > >         I'm confused by this, we have 33 virtual processors
>     available
>      > > >         across the 5 nodes (4 8-core machines and one single
>      > >     core) so my
>      > > >         interpretation of the manual is that I should be able to
>      > >     request
>      > > >         8 nodes, since these should be understood as virtual
>      > >     processors?
>      > > >         Am I doing something wrong?
>      > > >
>      > > >         I tried setting
>      > > >
>      > > >         #PBS -l procs=8
>      > > >
>      > > >         but that doesn't seem to do anything, MPI stops due
>     to having
>      > > >         only 1 worker available (single core allocated to the
>     job).
>      > > >
>      > > >         Thanks,
>      > > >         Andrew
>      > > >
>      > > >         p.s.
>      > > >
>      > > >         The queue I'm submitting jobs to is defined as:
>      > > >
>      > > >         create queue normal
>      > > >         set queue normal queue_type = Execution
>      > > >         set queue normal resources_min.cput = 12:00:00
>      > > >         set queue normal resources_default.cput = 24:00:00
>      > > >         set queue normal disallowed_types = interactive
>      > > >         set queue normal enabled = True
>      > > >         set queue normal started = True
>      > > >
>      > > >         and we are using torque version 2.5.12 and we are
>     using maui
>      > > >         3.3.1 for scheduling
>      > > >
>      > > >
>      > > >         _______________________________________________
>      > > >         torqueusers mailing list
>      > > > torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>      > > <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>
>      > > <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>      > > <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>>
>      > > > http://www.supercluster.org/mailman/listinfo/torqueusers
>      > > >
>      > > >
>      > > >
>      > > >     _______________________________________________
>      > > >     torqueusers mailing list
>      > > > torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>      > > <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>
>      > > <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>      > > <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>>
>      > > > http://www.supercluster.org/mailman/listinfo/torqueusers
>      > > >
>      > > >
>      > > >
>      > > > _______________________________________________
>      > > > torqueusers mailing list
>      > > > torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>     <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>
>      > > > http://www.supercluster.org/mailman/listinfo/torqueusers
>      > >
>      > >     _______________________________________________
>      > >     torqueusers mailing list
>      > > torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>     <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>
>      > > http://www.supercluster.org/mailman/listinfo/torqueusers
>      > >
>      > >
>      > >
>      > >
>      > > --
>      > > Dr Andrew Dawson
>      > > Atmospheric, Oceanic & Planetary Physics
>      > > Clarendon Laboratory
>      > > Parks Road
>      > > Oxford OX1 3PU, UK
>      > > Tel: +44 (0)1865 282438 <tel:%2B44%20%280%291865%20282438>
>      > > Email: dawson at atm.ox.ac.uk <mailto:dawson at atm.ox.ac.uk>
>     <mailto:dawson at atm.ox.ac.uk <mailto:dawson at atm.ox.ac.uk>>
>      > > Web Site: http://www2.physics.ox.ac.uk/contacts/people/dawson
>      > >
>      > >
>      > > _______________________________________________
>      > > torqueusers mailing list
>      > > torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>      > > http://www.supercluster.org/mailman/listinfo/torqueusers
>      >
>      > _______________________________________________
>      > torqueusers mailing list
>      > torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>      > http://www.supercluster.org/mailman/listinfo/torqueusers
>      >
>      >
>      >
>      > --
>      > Dr Andrew Dawson
>      > Atmospheric, Oceanic & Planetary Physics
>      > Clarendon Laboratory
>      > Parks Road
>      > Oxford OX1 3PU, UK
>      > Tel: +44 (0)1865 282438 <tel:%2B44%20%280%291865%20282438>
>      > Email: dawson at atm.ox.ac.uk <mailto:dawson at atm.ox.ac.uk>
>      > Web Site: http://www2.physics.ox.ac.uk/contacts/people/dawson
>      > _______________________________________________
>      > torqueusers mailing list
>      > torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>      > http://www.supercluster.org/mailman/listinfo/torqueusers
>
>     _______________________________________________
>     torqueusers mailing list
>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>     http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> --
> Dr Andrew Dawson
> Atmospheric, Oceanic & Planetary Physics
> Clarendon Laboratory
> Parks Road
> Oxford OX1 3PU, UK
> Tel: +44 (0)1865 282438
> Email: dawson at atm.ox.ac.uk <mailto:dawson at atm.ox.ac.uk>
> Web Site: http://www2.physics.ox.ac.uk/contacts/people/dawson
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list