[torqueusers] specifying nodes for MPI jobs on small cluster

Andrew Dawson dawson at atm.ox.ac.uk
Mon Feb 11 02:02:37 MST 2013


Hi Gus,

If I start enough jobs to fill all available CPUs plus a few extra then
torque will launch enough jobs to fill all available CPUs and queue the
remainder until a CPU becomes available. So everything is fine, that is
what I would expect.

Andrew


On 8 February 2013 16:29, Gustavo Correa <gus at ldeo.columbia.edu> wrote:

> Thank you Adrew.
>
> What happens if you launch enough jobs to fill out enough "nodes"
> that were assigned to resources_available.nodes (server and queues),
> but that do not actually exist physically (or as described by the nodes
> *file* in
> the server_priv directory)?
>
> For instance, if you set resources_available.nodes=41 (the number of
> cores/processors,
> not actual nodes in your cluster, IIRR), then launch enough jobs to fill
> 41 nodes,
> will all those jobs run, or will the actual nodes information stored in
> the nodes
> *file* take precedence?
> I.e. will the physical nodes-and-processors get oversubscribed,
> or will some jobs sit and wait in Q (queued) state,
> and Torque will only  run (R state) enough jobs to
> fill the physical nodes-and-cores available?
>
> I confess I find the context-dependent intepretation of the word "node"
> by Torque more harmful than helpful.
> It may also interact in unclear ways with
> the scheduler settings (eg. Maui).
> Maybe the context-dependent interpretation is to keep up with legacy
> interpretations.
> I would rather like a more rigid (and hopefully less confusing)
> notion of what a node and a processor are, even with the current blurring
> of  the
> dividing line by multicore processors, gpus, etc.
>
> I cannot test this now.  The cluster is a production machine,
> and now it is down due to a blizzard here.
>
> Thank you,
> Gus Correa
>
>
>
>
> On Feb 8, 2013, at 11:04 AM, Andrew Dawson wrote:
>
> > For others who are interested, the guidance at
> http://docs.adaptivecomputing.com/torque/Content/topics/11-troubleshooting/faq.htm#qsubNotAllowresolves my particular issue, so thanks Michel!
> >
> >
> > On 7 February 2013 21:40, Gus Correa <gus at ldeo.columbia.edu> wrote:
> > Hi Andrew
> >
> > I never got much luck with procs=YZ,
> > which is likely to be the syntax that matches what you want to do.
> > Maui (the scheduler I use) seems not to understand that
> > syntax very well.
> >
> > I wouldn't rely completely on the Torque documentation.
> > It has good guidelines, but may have mistakes in the details.
> > Trial and error may be the way to check what works for you.
> > I wonder if the error message you see may come
> > from different interpretations given to the word "node"
> > by the torque server (pbs_server) and the scheduler (which
> > maybe Maui, pbs_sched or perhaps Moab).
> >
> > If you want also to control to which nodes
> > (and sockets and cores) each MPI *process* is sent to,
> > I suggest that you build OpenMPI with Torque support.
> > OpenMPI when built with Torque support
> > will use the nodes and processors assigned
> > by Torque to that job,
> > but you can still decide how the sockets and
> > cores are distributed among the various MPI processes,
> > through switches to mpiexec such as --bynode, --bysocket,
> > --bycore, or even finer control through their "rankfiles".
> >
> > I hope this helps,
> > Gus Correa
> >
> > On 02/07/2013 03:54 PM, Andrew Dawson wrote:
> > > Hi Gus,
> > >
> > > Yes I can do that. What I would like to do is be able to have users
> > > request the number of CPUs for an MPI job and not have to care how
> these
> > > CPUs are distributed across physical nodes. If I do
> > >
> > > #PBS -l nodes=1:ppn=8
> > >
> > > then this will mean the job has to wait until there are 8 CPUs on one
> > > physical node before starting, correct?
> > >
> > >  From the torque documentation, it seems to say I can do:
> > >
> > > #PBS -l nodes=8
> > >
> > > and this will be interpreted as 8 CPUs rather than 8 physical nodes.
> > > This is what I want. Unfortunately I get the error message at
> submission
> > > time saying there are not enough resources to fulfill this request,
> even
> > > though there are 33 CPUs in the system. If on my system I do
> > >
> > > #PBS -l nodes=5
> > >
> > > then my MPI job gets sent to 5 CPUs, not necessarily on the same
> > > physical node, which is great and exactly what I want. I would
> therefore
> > > expect this to work for larger numbers but it seems that at submission
> > > time the request is checked against the number of physical nodes rather
> > > than virtual processors, meaning I cannot do this! It is quite
> frustrating.
> > >
> > > Please ask if there is further clarification I can make.
> > >
> > > Andrew
> > >
> > >
> > > On 7 February 2013 19:28, Gus Correa <gus at ldeo.columbia.edu
> > > <mailto:gus at ldeo.columbia.edu>> wrote:
> > >
> > >     Hi Andrew
> > >
> > >     Not sure I understood what exactly you want to do,
> > >     but have you tried this?
> > >
> > >     #PBS -l nodes=1:ppn=8
> > >
> > >
> > >     It will request one node with 8 processors.
> > >
> > >     I hope this helps,
> > >     Gus Correa
> > >
> > >     On 02/07/2013 11:38 AM, Andrew Dawson wrote:
> > >      > Nodes file looks like this:
> > >      >
> > >      > cirrus np=1
> > >      > cirrus1 np=8
> > >      > cirrus2 np=8
> > >      > cirrus3 np=8
> > >      > cirrus4 np=8
> > >      >
> > >      > On 7 Feb 2013 16:25, "Ricardo Román Brenes"
> > >     <roman.ricardo at gmail.com <mailto:roman.ricardo at gmail.com>
> > >      > <mailto:roman.ricardo at gmail.com
> > >     <mailto:roman.ricardo at gmail.com>>> wrote:
> > >      >
> > >      >     hi!
> > >      >
> > >      >     How does your node config file looks like?
> > >      >
> > >      >     On Thu, Feb 7, 2013 at 3:10 AM, Andrew Dawson
> > >     <dawson at atm.ox.ac.uk <mailto:dawson at atm.ox.ac.uk>
> > >      > <mailto:dawson at atm.ox.ac.uk <mailto:dawson at atm.ox.ac.uk>>>
> wrote:
> > >      >
> > >      >         Hi all,
> > >      >
> > >      >         I'm configuring a recent torque/maui installation and
> I'm
> > >     having
> > >      >         trouble with submitting MPI jobs. I would like for MPI
> > >     jobs to
> > >      >         specify the number of processors they require and have
> those
> > >      >         come from any available physical machine, the users
> shouldn't
> > >      >         need to specify processors per node etc.
> > >      >
> > >      >         The torque manual says that the nodes option is mapped
> to
> > >      >         virtual processors, so for example:
> > >      >
> > >      >              #PBS -l nodes=8
> > >      >
> > >      >         should request 8 virtual processors. The problem I'm
> > >     having is
> > >      >         that our cluster currently has only 5 physical machines
> > >     (nodes),
> > >      >         and setting nodes to anything greater than 5 gives the
> error:
> > >      >
> > >      >              qsub: Job exceeds queue resource limits MSG=cannot
> > >     locate
> > >      >         feasible nodes (nodes file is empty or all systems are
> busy)
> > >      >
> > >      >         I'm confused by this, we have 33 virtual processors
> available
> > >      >         across the 5 nodes (4 8-core machines and one single
> > >     core) so my
> > >      >         interpretation of the manual is that I should be able to
> > >     request
> > >      >         8 nodes, since these should be understood as virtual
> > >     processors?
> > >      >         Am I doing something wrong?
> > >      >
> > >      >         I tried setting
> > >      >
> > >      >         #PBS -l procs=8
> > >      >
> > >      >         but that doesn't seem to do anything, MPI stops due to
> having
> > >      >         only 1 worker available (single core allocated to the
> job).
> > >      >
> > >      >         Thanks,
> > >      >         Andrew
> > >      >
> > >      >         p.s.
> > >      >
> > >      >         The queue I'm submitting jobs to is defined as:
> > >      >
> > >      >         create queue normal
> > >      >         set queue normal queue_type = Execution
> > >      >         set queue normal resources_min.cput = 12:00:00
> > >      >         set queue normal resources_default.cput = 24:00:00
> > >      >         set queue normal disallowed_types = interactive
> > >      >         set queue normal enabled = True
> > >      >         set queue normal started = True
> > >      >
> > >      >         and we are using torque version 2.5.12 and we are using
> maui
> > >      >         3.3.1 for scheduling
> > >      >
> > >      >
> > >      >         _______________________________________________
> > >      >         torqueusers mailing list
> > >      > torqueusers at supercluster.org
> > >     <mailto:torqueusers at supercluster.org>
> > >     <mailto:torqueusers at supercluster.org
> > >     <mailto:torqueusers at supercluster.org>>
> > >      > http://www.supercluster.org/mailman/listinfo/torqueusers
> > >      >
> > >      >
> > >      >
> > >      >     _______________________________________________
> > >      >     torqueusers mailing list
> > >      > torqueusers at supercluster.org
> > >     <mailto:torqueusers at supercluster.org>
> > >     <mailto:torqueusers at supercluster.org
> > >     <mailto:torqueusers at supercluster.org>>
> > >      > http://www.supercluster.org/mailman/listinfo/torqueusers
> > >      >
> > >      >
> > >      >
> > >      > _______________________________________________
> > >      > torqueusers mailing list
> > >      > torqueusers at supercluster.org <mailto:
> torqueusers at supercluster.org>
> > >      > http://www.supercluster.org/mailman/listinfo/torqueusers
> > >
> > >     _______________________________________________
> > >     torqueusers mailing list
> > >     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
> > >     http://www.supercluster.org/mailman/listinfo/torqueusers
> > >
> > >
> > >
> > >
> > > --
> > > Dr Andrew Dawson
> > > Atmospheric, Oceanic & Planetary Physics
> > > Clarendon Laboratory
> > > Parks Road
> > > Oxford OX1 3PU, UK
> > > Tel: +44 (0)1865 282438
> > > Email: dawson at atm.ox.ac.uk <mailto:dawson at atm.ox.ac.uk>
> > > Web Site: http://www2.physics.ox.ac.uk/contacts/people/dawson
> > >
> > >
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> > --
> > Dr Andrew Dawson
> > Atmospheric, Oceanic & Planetary Physics
> > Clarendon Laboratory
> > Parks Road
> > Oxford OX1 3PU, UK
> > Tel: +44 (0)1865 282438
> > Email: dawson at atm.ox.ac.uk
> > Web Site: http://www2.physics.ox.ac.uk/contacts/people/dawson
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
Dr Andrew Dawson
Atmospheric, Oceanic & Planetary Physics
Clarendon Laboratory
Parks Road
Oxford OX1 3PU, UK
Tel: +44 (0)1865 282438
Email: dawson at atm.ox.ac.uk
Web Site: http://www2.physics.ox.ac.uk/contacts/people/dawson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130211/eb8174c9/attachment-0001.html 


More information about the torqueusers mailing list