[torqueusers] specifying nodes for MPI jobs on small cluster

Gustavo Correa gus at ldeo.columbia.edu
Fri Feb 8 09:29:17 MST 2013


Thank you Adrew.

What happens if you launch enough jobs to fill out enough "nodes"
that were assigned to resources_available.nodes (server and queues),
but that do not actually exist physically (or as described by the nodes *file* in 
the server_priv directory)?

For instance, if you set resources_available.nodes=41 (the number of cores/processors,
not actual nodes in your cluster, IIRR), then launch enough jobs to fill 41 nodes,
will all those jobs run, or will the actual nodes information stored in the nodes 
*file* take precedence?
I.e. will the physical nodes-and-processors get oversubscribed, 
or will some jobs sit and wait in Q (queued) state,
and Torque will only  run (R state) enough jobs to 
fill the physical nodes-and-cores available?

I confess I find the context-dependent intepretation of the word "node"
by Torque more harmful than helpful.
It may also interact in unclear ways with
the scheduler settings (eg. Maui).
Maybe the context-dependent interpretation is to keep up with legacy interpretations.
I would rather like a more rigid (and hopefully less confusing) 
notion of what a node and a processor are, even with the current blurring of  the 
dividing line by multicore processors, gpus, etc.

I cannot test this now.  The cluster is a production machine, 
and now it is down due to a blizzard here.

Thank you,
Gus Correa




On Feb 8, 2013, at 11:04 AM, Andrew Dawson wrote:

> For others who are interested, the guidance at http://docs.adaptivecomputing.com/torque/Content/topics/11-troubleshooting/faq.htm#qsubNotAllow resolves my particular issue, so thanks Michel!
> 
> 
> On 7 February 2013 21:40, Gus Correa <gus at ldeo.columbia.edu> wrote:
> Hi Andrew
> 
> I never got much luck with procs=YZ,
> which is likely to be the syntax that matches what you want to do.
> Maui (the scheduler I use) seems not to understand that
> syntax very well.
> 
> I wouldn't rely completely on the Torque documentation.
> It has good guidelines, but may have mistakes in the details.
> Trial and error may be the way to check what works for you.
> I wonder if the error message you see may come
> from different interpretations given to the word "node"
> by the torque server (pbs_server) and the scheduler (which
> maybe Maui, pbs_sched or perhaps Moab).
> 
> If you want also to control to which nodes
> (and sockets and cores) each MPI *process* is sent to,
> I suggest that you build OpenMPI with Torque support.
> OpenMPI when built with Torque support
> will use the nodes and processors assigned
> by Torque to that job,
> but you can still decide how the sockets and
> cores are distributed among the various MPI processes,
> through switches to mpiexec such as --bynode, --bysocket,
> --bycore, or even finer control through their "rankfiles".
> 
> I hope this helps,
> Gus Correa
> 
> On 02/07/2013 03:54 PM, Andrew Dawson wrote:
> > Hi Gus,
> >
> > Yes I can do that. What I would like to do is be able to have users
> > request the number of CPUs for an MPI job and not have to care how these
> > CPUs are distributed across physical nodes. If I do
> >
> > #PBS -l nodes=1:ppn=8
> >
> > then this will mean the job has to wait until there are 8 CPUs on one
> > physical node before starting, correct?
> >
> >  From the torque documentation, it seems to say I can do:
> >
> > #PBS -l nodes=8
> >
> > and this will be interpreted as 8 CPUs rather than 8 physical nodes.
> > This is what I want. Unfortunately I get the error message at submission
> > time saying there are not enough resources to fulfill this request, even
> > though there are 33 CPUs in the system. If on my system I do
> >
> > #PBS -l nodes=5
> >
> > then my MPI job gets sent to 5 CPUs, not necessarily on the same
> > physical node, which is great and exactly what I want. I would therefore
> > expect this to work for larger numbers but it seems that at submission
> > time the request is checked against the number of physical nodes rather
> > than virtual processors, meaning I cannot do this! It is quite frustrating.
> >
> > Please ask if there is further clarification I can make.
> >
> > Andrew
> >
> >
> > On 7 February 2013 19:28, Gus Correa <gus at ldeo.columbia.edu
> > <mailto:gus at ldeo.columbia.edu>> wrote:
> >
> >     Hi Andrew
> >
> >     Not sure I understood what exactly you want to do,
> >     but have you tried this?
> >
> >     #PBS -l nodes=1:ppn=8
> >
> >
> >     It will request one node with 8 processors.
> >
> >     I hope this helps,
> >     Gus Correa
> >
> >     On 02/07/2013 11:38 AM, Andrew Dawson wrote:
> >      > Nodes file looks like this:
> >      >
> >      > cirrus np=1
> >      > cirrus1 np=8
> >      > cirrus2 np=8
> >      > cirrus3 np=8
> >      > cirrus4 np=8
> >      >
> >      > On 7 Feb 2013 16:25, "Ricardo Román Brenes"
> >     <roman.ricardo at gmail.com <mailto:roman.ricardo at gmail.com>
> >      > <mailto:roman.ricardo at gmail.com
> >     <mailto:roman.ricardo at gmail.com>>> wrote:
> >      >
> >      >     hi!
> >      >
> >      >     How does your node config file looks like?
> >      >
> >      >     On Thu, Feb 7, 2013 at 3:10 AM, Andrew Dawson
> >     <dawson at atm.ox.ac.uk <mailto:dawson at atm.ox.ac.uk>
> >      > <mailto:dawson at atm.ox.ac.uk <mailto:dawson at atm.ox.ac.uk>>> wrote:
> >      >
> >      >         Hi all,
> >      >
> >      >         I'm configuring a recent torque/maui installation and I'm
> >     having
> >      >         trouble with submitting MPI jobs. I would like for MPI
> >     jobs to
> >      >         specify the number of processors they require and have those
> >      >         come from any available physical machine, the users shouldn't
> >      >         need to specify processors per node etc.
> >      >
> >      >         The torque manual says that the nodes option is mapped to
> >      >         virtual processors, so for example:
> >      >
> >      >              #PBS -l nodes=8
> >      >
> >      >         should request 8 virtual processors. The problem I'm
> >     having is
> >      >         that our cluster currently has only 5 physical machines
> >     (nodes),
> >      >         and setting nodes to anything greater than 5 gives the error:
> >      >
> >      >              qsub: Job exceeds queue resource limits MSG=cannot
> >     locate
> >      >         feasible nodes (nodes file is empty or all systems are busy)
> >      >
> >      >         I'm confused by this, we have 33 virtual processors available
> >      >         across the 5 nodes (4 8-core machines and one single
> >     core) so my
> >      >         interpretation of the manual is that I should be able to
> >     request
> >      >         8 nodes, since these should be understood as virtual
> >     processors?
> >      >         Am I doing something wrong?
> >      >
> >      >         I tried setting
> >      >
> >      >         #PBS -l procs=8
> >      >
> >      >         but that doesn't seem to do anything, MPI stops due to having
> >      >         only 1 worker available (single core allocated to the job).
> >      >
> >      >         Thanks,
> >      >         Andrew
> >      >
> >      >         p.s.
> >      >
> >      >         The queue I'm submitting jobs to is defined as:
> >      >
> >      >         create queue normal
> >      >         set queue normal queue_type = Execution
> >      >         set queue normal resources_min.cput = 12:00:00
> >      >         set queue normal resources_default.cput = 24:00:00
> >      >         set queue normal disallowed_types = interactive
> >      >         set queue normal enabled = True
> >      >         set queue normal started = True
> >      >
> >      >         and we are using torque version 2.5.12 and we are using maui
> >      >         3.3.1 for scheduling
> >      >
> >      >
> >      >         _______________________________________________
> >      >         torqueusers mailing list
> >      > torqueusers at supercluster.org
> >     <mailto:torqueusers at supercluster.org>
> >     <mailto:torqueusers at supercluster.org
> >     <mailto:torqueusers at supercluster.org>>
> >      > http://www.supercluster.org/mailman/listinfo/torqueusers
> >      >
> >      >
> >      >
> >      >     _______________________________________________
> >      >     torqueusers mailing list
> >      > torqueusers at supercluster.org
> >     <mailto:torqueusers at supercluster.org>
> >     <mailto:torqueusers at supercluster.org
> >     <mailto:torqueusers at supercluster.org>>
> >      > http://www.supercluster.org/mailman/listinfo/torqueusers
> >      >
> >      >
> >      >
> >      > _______________________________________________
> >      > torqueusers mailing list
> >      > torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
> >      > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >     _______________________________________________
> >     torqueusers mailing list
> >     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
> >     http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> >
> > --
> > Dr Andrew Dawson
> > Atmospheric, Oceanic & Planetary Physics
> > Clarendon Laboratory
> > Parks Road
> > Oxford OX1 3PU, UK
> > Tel: +44 (0)1865 282438
> > Email: dawson at atm.ox.ac.uk <mailto:dawson at atm.ox.ac.uk>
> > Web Site: http://www2.physics.ox.ac.uk/contacts/people/dawson
> >
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> 
> -- 
> Dr Andrew Dawson
> Atmospheric, Oceanic & Planetary Physics
> Clarendon Laboratory
> Parks Road
> Oxford OX1 3PU, UK
> Tel: +44 (0)1865 282438
> Email: dawson at atm.ox.ac.uk
> Web Site: http://www2.physics.ox.ac.uk/contacts/people/dawson
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list