[torqueusers] specifying nodes for MPI jobs on small cluster
Andrew Dawson
dawson at atm.ox.ac.uk
Mon Feb 11 02:02:37 MST 2013
Hi Gus,
If I start enough jobs to fill all available CPUs plus a few extra then
torque will launch enough jobs to fill all available CPUs and queue the
remainder until a CPU becomes available. So everything is fine, that is
what I would expect.
Andrew
On 8 February 2013 16:29, Gustavo Correa <gus at ldeo.columbia.edu> wrote:
> Thank you Adrew.
>
> What happens if you launch enough jobs to fill out enough "nodes"
> that were assigned to resources_available.nodes (server and queues),
> but that do not actually exist physically (or as described by the nodes
> *file* in
> the server_priv directory)?
>
> For instance, if you set resources_available.nodes=41 (the number of
> cores/processors,
> not actual nodes in your cluster, IIRR), then launch enough jobs to fill
> 41 nodes,
> will all those jobs run, or will the actual nodes information stored in
> the nodes
> *file* take precedence?
> I.e. will the physical nodes-and-processors get oversubscribed,
> or will some jobs sit and wait in Q (queued) state,
> and Torque will only run (R state) enough jobs to
> fill the physical nodes-and-cores available?
>
> I confess I find the context-dependent intepretation of the word "node"
> by Torque more harmful than helpful.
> It may also interact in unclear ways with
> the scheduler settings (eg. Maui).
> Maybe the context-dependent interpretation is to keep up with legacy
> interpretations.
> I would rather like a more rigid (and hopefully less confusing)
> notion of what a node and a processor are, even with the current blurring
> of the
> dividing line by multicore processors, gpus, etc.
>
> I cannot test this now. The cluster is a production machine,
> and now it is down due to a blizzard here.
>
> Thank you,
> Gus Correa
>
>
>
>
> On Feb 8, 2013, at 11:04 AM, Andrew Dawson wrote:
>
> > For others who are interested, the guidance at
> http://docs.adaptivecomputing.com/torque/Content/topics/11-troubleshooting/faq.htm#qsubNotAllowresolves my particular issue, so thanks Michel!
> >
> >
> > On 7 February 2013 21:40, Gus Correa <gus at ldeo.columbia.edu> wrote:
> > Hi Andrew
> >
> > I never got much luck with procs=YZ,
> > which is likely to be the syntax that matches what you want to do.
> > Maui (the scheduler I use) seems not to understand that
> > syntax very well.
> >
> > I wouldn't rely completely on the Torque documentation.
> > It has good guidelines, but may have mistakes in the details.
> > Trial and error may be the way to check what works for you.
> > I wonder if the error message you see may come
> > from different interpretations given to the word "node"
> > by the torque server (pbs_server) and the scheduler (which
> > maybe Maui, pbs_sched or perhaps Moab).
> >
> > If you want also to control to which nodes
> > (and sockets and cores) each MPI *process* is sent to,
> > I suggest that you build OpenMPI with Torque support.
> > OpenMPI when built with Torque support
> > will use the nodes and processors assigned
> > by Torque to that job,
> > but you can still decide how the sockets and
> > cores are distributed among the various MPI processes,
> > through switches to mpiexec such as --bynode, --bysocket,
> > --bycore, or even finer control through their "rankfiles".
> >
> > I hope this helps,
> > Gus Correa
> >
> > On 02/07/2013 03:54 PM, Andrew Dawson wrote:
> > > Hi Gus,
> > >
> > > Yes I can do that. What I would like to do is be able to have users
> > > request the number of CPUs for an MPI job and not have to care how
> these
> > > CPUs are distributed across physical nodes. If I do
> > >
> > > #PBS -l nodes=1:ppn=8
> > >
> > > then this will mean the job has to wait until there are 8 CPUs on one
> > > physical node before starting, correct?
> > >
> > > From the torque documentation, it seems to say I can do:
> > >
> > > #PBS -l nodes=8
> > >
> > > and this will be interpreted as 8 CPUs rather than 8 physical nodes.
> > > This is what I want. Unfortunately I get the error message at
> submission
> > > time saying there are not enough resources to fulfill this request,
> even
> > > though there are 33 CPUs in the system. If on my system I do
> > >
> > > #PBS -l nodes=5
> > >
> > > then my MPI job gets sent to 5 CPUs, not necessarily on the same
> > > physical node, which is great and exactly what I want. I would
> therefore
> > > expect this to work for larger numbers but it seems that at submission
> > > time the request is checked against the number of physical nodes rather
> > > than virtual processors, meaning I cannot do this! It is quite
> frustrating.
> > >
> > > Please ask if there is further clarification I can make.
> > >
> > > Andrew
> > >
> > >
> > > On 7 February 2013 19:28, Gus Correa <gus at ldeo.columbia.edu
> > > <mailto:gus at ldeo.columbia.edu>> wrote:
> > >
> > > Hi Andrew
> > >
> > > Not sure I understood what exactly you want to do,
> > > but have you tried this?
> > >
> > > #PBS -l nodes=1:ppn=8
> > >
> > >
> > > It will request one node with 8 processors.
> > >
> > > I hope this helps,
> > > Gus Correa
> > >
> > > On 02/07/2013 11:38 AM, Andrew Dawson wrote:
> > > > Nodes file looks like this:
> > > >
> > > > cirrus np=1
> > > > cirrus1 np=8
> > > > cirrus2 np=8
> > > > cirrus3 np=8
> > > > cirrus4 np=8
> > > >
> > > > On 7 Feb 2013 16:25, "Ricardo Román Brenes"
> > > <roman.ricardo at gmail.com <mailto:roman.ricardo at gmail.com>
> > > > <mailto:roman.ricardo at gmail.com
> > > <mailto:roman.ricardo at gmail.com>>> wrote:
> > > >
> > > > hi!
> > > >
> > > > How does your node config file looks like?
> > > >
> > > > On Thu, Feb 7, 2013 at 3:10 AM, Andrew Dawson
> > > <dawson at atm.ox.ac.uk <mailto:dawson at atm.ox.ac.uk>
> > > > <mailto:dawson at atm.ox.ac.uk <mailto:dawson at atm.ox.ac.uk>>>
> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > I'm configuring a recent torque/maui installation and
> I'm
> > > having
> > > > trouble with submitting MPI jobs. I would like for MPI
> > > jobs to
> > > > specify the number of processors they require and have
> those
> > > > come from any available physical machine, the users
> shouldn't
> > > > need to specify processors per node etc.
> > > >
> > > > The torque manual says that the nodes option is mapped
> to
> > > > virtual processors, so for example:
> > > >
> > > > #PBS -l nodes=8
> > > >
> > > > should request 8 virtual processors. The problem I'm
> > > having is
> > > > that our cluster currently has only 5 physical machines
> > > (nodes),
> > > > and setting nodes to anything greater than 5 gives the
> error:
> > > >
> > > > qsub: Job exceeds queue resource limits MSG=cannot
> > > locate
> > > > feasible nodes (nodes file is empty or all systems are
> busy)
> > > >
> > > > I'm confused by this, we have 33 virtual processors
> available
> > > > across the 5 nodes (4 8-core machines and one single
> > > core) so my
> > > > interpretation of the manual is that I should be able to
> > > request
> > > > 8 nodes, since these should be understood as virtual
> > > processors?
> > > > Am I doing something wrong?
> > > >
> > > > I tried setting
> > > >
> > > > #PBS -l procs=8
> > > >
> > > > but that doesn't seem to do anything, MPI stops due to
> having
> > > > only 1 worker available (single core allocated to the
> job).
> > > >
> > > > Thanks,
> > > > Andrew
> > > >
> > > > p.s.
> > > >
> > > > The queue I'm submitting jobs to is defined as:
> > > >
> > > > create queue normal
> > > > set queue normal queue_type = Execution
> > > > set queue normal resources_min.cput = 12:00:00
> > > > set queue normal resources_default.cput = 24:00:00
> > > > set queue normal disallowed_types = interactive
> > > > set queue normal enabled = True
> > > > set queue normal started = True
> > > >
> > > > and we are using torque version 2.5.12 and we are using
> maui
> > > > 3.3.1 for scheduling
> > > >
> > > >
> > > > _______________________________________________
> > > > torqueusers mailing list
> > > > torqueusers at supercluster.org
> > > <mailto:torqueusers at supercluster.org>
> > > <mailto:torqueusers at supercluster.org
> > > <mailto:torqueusers at supercluster.org>>
> > > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > torqueusers mailing list
> > > > torqueusers at supercluster.org
> > > <mailto:torqueusers at supercluster.org>
> > > <mailto:torqueusers at supercluster.org
> > > <mailto:torqueusers at supercluster.org>>
> > > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > torqueusers mailing list
> > > > torqueusers at supercluster.org <mailto:
> torqueusers at supercluster.org>
> > > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > >
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > >
> > >
> > >
> > >
> > > --
> > > Dr Andrew Dawson
> > > Atmospheric, Oceanic & Planetary Physics
> > > Clarendon Laboratory
> > > Parks Road
> > > Oxford OX1 3PU, UK
> > > Tel: +44 (0)1865 282438
> > > Email: dawson at atm.ox.ac.uk <mailto:dawson at atm.ox.ac.uk>
> > > Web Site: http://www2.physics.ox.ac.uk/contacts/people/dawson
> > >
> > >
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> > --
> > Dr Andrew Dawson
> > Atmospheric, Oceanic & Planetary Physics
> > Clarendon Laboratory
> > Parks Road
> > Oxford OX1 3PU, UK
> > Tel: +44 (0)1865 282438
> > Email: dawson at atm.ox.ac.uk
> > Web Site: http://www2.physics.ox.ac.uk/contacts/people/dawson
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
--
Dr Andrew Dawson
Atmospheric, Oceanic & Planetary Physics
Clarendon Laboratory
Parks Road
Oxford OX1 3PU, UK
Tel: +44 (0)1865 282438
Email: dawson at atm.ox.ac.uk
Web Site: http://www2.physics.ox.ac.uk/contacts/people/dawson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130211/eb8174c9/attachment-0001.html
More information about the torqueusers
mailing list