[torqueusers] Torque over-limit submission accepted.

Martin Siegert siegert at sfu.ca
Tue Sep 6 12:39:12 MDT 2011


Hi Marc,

AFAIK ncpus is not a resource interpreted by torque; it is a relict
from ancient history (SMP machines; not clusters) and in my experience
usually (I do not know maui) interpreted like nodes=1:ppn=x by the
scheduler.

Thus, I doubt that ncpus will have any effect on how jobs that
request a nodes resource in the form nodes=x:ppn=y are routed.
E.g., the job with -l nodes=14:ppn=12 is routed to the medium queue
because of the 
set queue small resources_max.nodect = 12
setting: the job requests 14 nodes. None of the resources_min.ncpus
and/or resources_max.ncpus settings come into play since none of your
jobs request -l ncpus=x. And if they would those requests would be
interpreted in a way that has little to do with the intention of the
user. For that reason I prevent the accidental usage of ncpus on my
clusters through
set server resources_max.ncpus = 0
which causes jobs with a ncpus specification to be rejected rightaway.

In short: do not use ncpus.
As far as I understand you want to route your jobs depending on the
number of requested processors (cores). Since torque-2.5.6 you can
use the procct resource to configure this; torque determines procct
from the nodes and/or proc specification of the job: requests of
the form -l nodes=x:ppn=y and/or -l procs=z result in procct=x*y+z,
e.g., -l nodes=2:ppn=12 results in procct=24. Thus, you probably want
to set

set queue small resources_max.procct = 12
set queue medium resources_min.procct = 13
set queue medium resources_max.procct = 64
set queue large resources_max.procct = 65
set queue large resources_max.procct = 168

and remove all ncpus (and possibly nodect) specifications in qmgr.

Cheers,
Martin

-- 
Martin Siegert
Simon Fraser University

On Mon, Sep 05, 2011 at 09:45:15PM +0200, Marc Mendez-Bermond wrote:
> Hi all,
> 
> -= Sorry if repost, but I haven't seen my message appear on the list =-
> 
> I am fighting with a *new* installation of Torque coupled with Maui 
> where 3 queues are defined and when I try to submit a job to the "small" 
> queue with more cores than its max allowed, the job is accepted.
> 
> For example, the 'small resources_max.ncpus = 12' and the queue accepts 
> '-l nodes=2:ppn=12 -q small' requests ... It looks like the nodes value 
> only is considered which is quite confirmed if I try the following :
> '-l nodes=14:ppn=12' is being routed to the "medium" queue defined as 
> 'set queue medium resources_max.ncpus = 64'.
> 
> Versions are :
> - torque-2.5.7-1.el5.1 (EPEL5 RPMs for RHEL/CENTOS 5)
> - maui-3.3-4.el5 (https://svnweb.cern.ch/trac/maui)
> 
> Its configuration is detailed below and I think Maui is out of the cause 
> as using the pbs_sched will lead to the same issue.
> 
> Any help appreciated !
> 
> Regards,
> M.
> 
> ======
> 
> #
> # Create queues and set their attributes.
> #
> #
> # Create and define queue medium
> #
> create queue medium
> set queue medium queue_type = Execution
> set queue medium max_queuable = 100
> set queue medium resources_max.ncpus = 64
> set queue medium resources_max.nodect = 64
> set queue medium resources_min.ncpus = 13
> set queue medium resources_default.walltime = 48:00:00
> set queue medium enabled = True
> set queue medium started = True
> #
> # Create and define queue large
> #
> create queue large
> set queue large queue_type = Execution
> set queue large max_queuable = 100
> set queue large resources_max.ncpus = 168
> set queue large resources_max.nodect = 168
> set queue large resources_min.ncpus = 65
> set queue large resources_default.walltime = 24:00:00
> set queue large enabled = True
> set queue large started = True
> #
> # Create and define queue small
> #
> create queue small
> set queue small queue_type = Execution
> set queue small max_queuable = 100
> set queue small resources_max.ncpus = 12
> set queue small resources_max.nodect = 12
> set queue small resources_default.walltime = 96:00:00
> set queue small enabled = True
> set queue small started = True
> #
> # Create and define queue portalq
> #
> create queue portalq
> set queue portalq queue_type = Route
> set queue portalq route_destinations = small
> set queue portalq route_destinations += medium
> set queue portalq route_destinations += large
> set queue portalq enabled = True
> set queue portalq started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server acl_hosts = master.mycluster.org
> set server managers = root at master.mycluster.org
> set server operators = root at master.mycluster.org
> set server default_queue = portalq
> set server log_events = 511
> set server mail_from = adm
> set server resources_default.nodect = 1
> set server resources_default.walltime = 00:15:00
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 6
> set server queue_centric_limits = True
> set server mom_job_sync = True
> set server keep_completed = 300
> set server next_job_number = 183
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list