[torqueusers] Toy routing queues not working correctly.

Jeremy Hallum jhallum at umich.edu
Fri Jul 11 06:37:17 MDT 2008


I'm working with some toy routing queues and I don't understand why
these things aren't working.  Here's the specs:

Torque 2.3.0

the output of pbs_server is below:
create queue small
set queue small queue_type = Execution
set queue small Priority = 20
set queue small resources_max.nodes = 1
set queue small resources_min.nodes = 1
set queue small resources_default.nodes = 1
set queue small enabled = True
set queue small started = True
#
# Create and define queue default
#
create queue default
set queue default queue_type = Route
set queue default route_destinations = large
set queue default route_destinations += medium
set queue default route_destinations += small
set queue default enabled = True
set queue default started = True
#
# Create and define queue medium
#
create queue medium
set queue medium queue_type = Execution
set queue medium Priority = 20
set queue medium resources_max.nodes = 7
set queue medium resources_min.nodes = 2
set queue medium resources_default.nodes = 4
set queue medium enabled = True
set queue medium started = True
#
# Create and define queue large
#
create queue large
set queue large queue_type = Execution
set queue large Priority = 20
set queue large resources_max.nodes = 32
set queue large resources_min.nodes = 8
set queue large resources_default.nodes = 10
set queue large enabled = True
set queue large started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = xxx.xxx.lsa.umich.edu
set server managers = maui at xxx.xxx.lsa.umich.edu
set server managers += root at xxx.xxx.lsa.umich.edu
set server default_queue = default
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server log_level = 0
set server queue_centric_limits = True
set server next_job_number = 507

As you can see, a really basic routing model.  The problem is that when
a job of 4 nodes is submitted, it gets dropped right in the first queue,
large, rather than getting dropped down to the medium queue, where it
should be.  

If I disable and stop the first queue, it skips to the small queue,
rather than using the medium Execution queue. 

I've tried using tracejob and increasing the log_level (to 7) at the
server level to determine what logic the server is using to place the
jobs, but that's not enough, the best info I get is:

07/11/2008 08:04:42  S    enqueuing into default, state 1 hop 1
07/11/2008 08:04:42  S    dequeuing from default, state QUEUED
07/11/2008 08:04:42  S    enqueuing into large, state 1 hop 1

Has anyone else seen a problem like this? What other steps can I take to
try to diagnose the problem?  I've tried:

recreating the entire pbs_server database. pbs_server -t create
flipping the order of the queues (the order is always first queue gets
the job).
At first I used Maui, I switched to pbs_sched later on and it still
isn't working right, which is why I suspect a setting in qmgr is the
culprit.

Thanks for any help you can give, and let me know if you need more info.

-jeremy


-- 
Jeremy Hallum
System Adminstrator, Research Systems Group
LSA Information Technology
University of Michigan
jhallum at umich.edu



More information about the torqueusers mailing list