[Mauiusers] Routing Queues (was Re Large queues cause Maui to idle)

Michael Galloway mgx at ornl.gov
Tue Dec 16 06:12:37 MST 2008


On Wed, Dec 10, 2008 at 03:57:02PM -0500, Steve Young wrote:
> I've used a routing queue to solve this problem. The queue that the user is 
> running on can only utilize 32 cpu's. The thousands of jobs are 1 cpu each. 
> So I have this for a routing queue:
>
> create queue physics
> set queue physics queue_type = Route
> set queue physics acl_group_enable = True
> set queue physics route_destinations += herc
> set queue physics enabled = True
> set queue physics started = True
>
> So jobs that go into here are moved to the herc execution queue. This queue 
> has the following setting:
>
> set queue herc max_queuable = 36
>
> This way only 36 jobs at time can be queue'd from the routing queue. This 
> way maui doesn't even have to worry about considering each of all the 
> thousand's of jobs each iteration. It only has to worry about scheduling 
> the jobs for the resources it has to run on.
>
> I also use MAXIJOB in maui:
>
> CLASSCFG[herc]		QLIST=md QDEF=md MAXIJOB=4
>
> This way even if a user had lots of jobs in the queue only their top 4 idle 
> jobs will get considered for scheduling. This way others will be able to 
> get their jobs to run without having to wait for maui to process thousands 
> of jobs that can't run yet anyhow.
>

ok, i'm in this boat as well (lots of serial jobs). i attempted to implement this
thusly:

create queue sroute
set queue sroute queue_type = Route
set queue sroute acl_group_enable = True
set queue sroute route_destinations = serial
set queue sroute route_destinations += serial
set queue sroute enabled = True
set queue sroute started = True

create queue serial
set queue serial queue_type = Execution
set queue serial max_queuable = 36
set queue serial resources_max.walltime = 168:00:00
set queue serial resources_default.neednodes = serial
set queue serial resources_default.nodes = 12
set queue serial resources_default.walltime = 168:00:00
set queue serial enabled = True
set queue serial started = True

jobs are dropping from the routing queue into the serial
queue but not running:

[root at bioinfo server_logs]# qstat -q

server: 

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
annotate           --      --    18:00:00   --    0   0 --   E R
sroute             --      --       --      --    0 8163 --   E R
md                 --      --    168:00:0   --   18   0 --   E R
serial             --      --    168:00:0   --    0  36 --   E R
                                               ----- -----
                                                  18  8199


[root at bioinfo server_logs]# showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

51493                   pkc    Running     1  2:13:33:53  Thu Dec 11 21:41:56
68422                   pkc    Running     1  6:06:22:44  Mon Dec 15 14:30:47
68423                   pkc    Running     1  6:06:22:44  Mon Dec 15 14:30:47

    18 Active Jobs      18 of  260 Processors Active (6.92%)
                         5 of   66 Nodes Active      (7.58%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

68424                   bci       Idle     1  7:00:00:00  Tue Dec 16 03:38:51
68425                   bci       Idle     1  7:00:00:00  Tue Dec 16 03:38:51
68426                   bci       Idle     1  7:00:00:00  Tue Dec 16 03:38:51
68427                   bci       Idle     1  7:00:00:00  Tue Dec 16 03:41:21
68428                   bci       Idle     1  7:00:00:00  Tue Dec 16 03:45:53
68429                   bci       Idle     1  7:00:00:00  Tue Dec 16 03:48:23
68430                   bci       Idle     1  7:00:00:00  Tue Dec 16 03:51:53
68431                   bci       Idle     1  7:00:00:00  Tue Dec 16 03:52:53
68432                   bci       Idle     1  7:00:00:00  Tue Dec 16 03:58:56
68433                   bci       Idle     1  7:00:00:00  Tue Dec 16 03:59:26
68434                   bci       Idle     1  7:00:00:00  Tue Dec 16 03:59:26
68435                   bci       Idle     1  7:00:00:00  Tue Dec 16 04:00:26
68436                   bci       Idle     1  7:00:00:00  Tue Dec 16 04:00:56
68437                   bci       Idle     1  7:00:00:00  Tue Dec 16 04:05:28
68438                   bci       Idle     1  7:00:00:00  Tue Dec 16 04:07:00
68439                   bci       Idle     1  7:00:00:00  Tue Dec 16 04:08:30
68440                   bci       Idle     1  7:00:00:00  Tue Dec 16 04:10:34
68441                   bci       Idle     1  7:00:00:00  Tue Dec 16 04:16:36
68442                   bci       Idle     1  7:00:00:00  Tue Dec 16 04:18:36
68443                   bci       Idle     1  7:00:00:00  Tue Dec 16 04:20:39
68444                   bci       Idle     1  7:00:00:00  Tue Dec 16 04:20:39
68445                   bci       Idle     1  7:00:00:00  Tue Dec 16 04:26:19
68446                   bci       Idle     1  7:00:00:00  Tue Dec 16 04:28:19
68447                   bci       Idle     1  7:00:00:00  Tue Dec 16 04:28:19
68448                   bci       Idle     1  7:00:00:00  Tue Dec 16 04:29:26
68449                   bci       Idle     1  7:00:00:00  Tue Dec 16 04:34:02
68450                   bci       Idle     1  7:00:00:00  Tue Dec 16 04:35:04
68451                   bci       Idle     1  7:00:00:00  Tue Dec 16 04:38:40
68452                   bci       Idle     1  7:00:00:00  Tue Dec 16 04:40:42
68453                   bci       Idle     1  7:00:00:00  Tue Dec 16 04:41:42
68454                   bci       Idle     1  7:00:00:00  Tue Dec 16 04:50:05
68455                   bci       Idle     1  7:00:00:00  Tue Dec 16 05:02:38
68456                   bci       Idle     1  7:00:00:00  Tue Dec 16 05:03:08
68457                   bci       Idle     1  7:00:00:00  Tue Dec 16 05:04:42
68458                   bci       Idle     1  7:00:00:00  Tue Dec 16 05:11:24
68459                   bci       Idle     1  7:00:00:00  Tue Dec 16 05:25:36

Total Jobs: 54   Active Jobs: 18   Idle Jobs: 0   Blocked Jobs: 36

and checkjob on one of the blocked jobs:

[root at bioinfo server_logs]# checkjob 68440
checking job 68440

State: Idle
Creds:  user:bci  group:bci  class:serial  qos:DEFAULT
WallTime: 00:00:00 of 7:00:00:00
SubmitTime: Tue Dec 16 04:10:34
  (Time Queued  Total: 4:00:04  Eligible: 00:00:00)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [serial]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 4
PartitionMask: [ALL]
Flags:       HOSTLIST RESTARTABLE
HostList: 
  [c0-70:1]
Holds:    Defer  
Messages:  job cannot be started - cannot set hostlist
PE:  1.00  StartPriority:  41
cannot select job 68440 for partition DEFAULT (job hold active)

i clearly made an error somewhere, just cannot see it. any help
greatly apprecited.

-- michael




More information about the mauiusers mailing list