[Mauiusers] Routing Queues (was Re Large queues cause Maui to
idle)
Steve Young
chemadm at hamilton.edu
Tue Dec 16 07:27:37 MST 2008
Hi Michael,
First I would try to submit some jobs just to the execution queue to
make sure it works. I'm wondering since it says " job cannot be
started - cannot set hostlist" that you have a list of machines in
your server_priv/nodes file that lists "serial" as a feature to a
certain amount of nodes for this queue. Another thing I wonder is what
does the batch script for the job look like? Is the user using -l
host=<name of node> in it? I'm not for certain what the message is
supposed to mean but it sounds like it isn't able to find any nodes to
allocate the job to. Hope this helps,
-Steve
On Dec 16, 2008, at 8:12 AM, Michael Galloway wrote:
>
> On Wed, Dec 10, 2008 at 03:57:02PM -0500, Steve Young wrote:
>> I've used a routing queue to solve this problem. The queue that the
>> user is
>> running on can only utilize 32 cpu's. The thousands of jobs are 1
>> cpu each.
>> So I have this for a routing queue:
>>
>> create queue physics
>> set queue physics queue_type = Route
>> set queue physics acl_group_enable = True
>> set queue physics route_destinations += herc
>> set queue physics enabled = True
>> set queue physics started = True
>>
>> So jobs that go into here are moved to the herc execution queue.
>> This queue
>> has the following setting:
>>
>> set queue herc max_queuable = 36
>>
>> This way only 36 jobs at time can be queue'd from the routing
>> queue. This
>> way maui doesn't even have to worry about considering each of all the
>> thousand's of jobs each iteration. It only has to worry about
>> scheduling
>> the jobs for the resources it has to run on.
>>
>> I also use MAXIJOB in maui:
>>
>> CLASSCFG[herc] QLIST=md QDEF=md MAXIJOB=4
>>
>> This way even if a user had lots of jobs in the queue only their
>> top 4 idle
>> jobs will get considered for scheduling. This way others will be
>> able to
>> get their jobs to run without having to wait for maui to process
>> thousands
>> of jobs that can't run yet anyhow.
>>
>
> ok, i'm in this boat as well (lots of serial jobs). i attempted to
> implement this
> thusly:
>
> create queue sroute
> set queue sroute queue_type = Route
> set queue sroute acl_group_enable = True
> set queue sroute route_destinations = serial
> set queue sroute route_destinations += serial
> set queue sroute enabled = True
> set queue sroute started = True
>
> create queue serial
> set queue serial queue_type = Execution
> set queue serial max_queuable = 36
> set queue serial resources_max.walltime = 168:00:00
> set queue serial resources_default.neednodes = serial
> set queue serial resources_default.nodes = 12
> set queue serial resources_default.walltime = 168:00:00
> set queue serial enabled = True
> set queue serial started = True
>
> jobs are dropping from the routing queue into the serial
> queue but not running:
>
> [root at bioinfo server_logs]# qstat -q
>
> server:
>
> Queue Memory CPU Time Walltime Node Run Que Lm State
> ---------------- ------ -------- -------- ---- --- --- -- -----
> annotate -- -- 18:00:00 -- 0 0 -- E R
> sroute -- -- -- -- 0 8163 -- E R
> md -- -- 168:00:0 -- 18 0 -- E R
> serial -- -- 168:00:0 -- 0 36 -- E R
> ----- -----
> 18 8199
>
>
> [root at bioinfo server_logs]# showq
> ACTIVE JOBS--------------------
> JOBNAME USERNAME STATE PROC REMAINING
> STARTTIME
>
> 51493 pkc Running 1 2:13:33:53 Thu Dec 11
> 21:41:56
> 68422 pkc Running 1 6:06:22:44 Mon Dec 15
> 14:30:47
> 68423 pkc Running 1 6:06:22:44 Mon Dec 15
> 14:30:47
>
> 18 Active Jobs 18 of 260 Processors Active (6.92%)
> 5 of 66 Nodes Active (7.58%)
>
> IDLE JOBS----------------------
> JOBNAME USERNAME STATE PROC WCLIMIT
> QUEUETIME
>
>
> 0 Idle Jobs
>
> BLOCKED JOBS----------------
> JOBNAME USERNAME STATE PROC WCLIMIT
> QUEUETIME
>
> 68424 bci Idle 1 7:00:00:00 Tue Dec 16
> 03:38:51
> 68425 bci Idle 1 7:00:00:00 Tue Dec 16
> 03:38:51
> 68426 bci Idle 1 7:00:00:00 Tue Dec 16
> 03:38:51
> 68427 bci Idle 1 7:00:00:00 Tue Dec 16
> 03:41:21
> 68428 bci Idle 1 7:00:00:00 Tue Dec 16
> 03:45:53
> 68429 bci Idle 1 7:00:00:00 Tue Dec 16
> 03:48:23
> 68430 bci Idle 1 7:00:00:00 Tue Dec 16
> 03:51:53
> 68431 bci Idle 1 7:00:00:00 Tue Dec 16
> 03:52:53
> 68432 bci Idle 1 7:00:00:00 Tue Dec 16
> 03:58:56
> 68433 bci Idle 1 7:00:00:00 Tue Dec 16
> 03:59:26
> 68434 bci Idle 1 7:00:00:00 Tue Dec 16
> 03:59:26
> 68435 bci Idle 1 7:00:00:00 Tue Dec 16
> 04:00:26
> 68436 bci Idle 1 7:00:00:00 Tue Dec 16
> 04:00:56
> 68437 bci Idle 1 7:00:00:00 Tue Dec 16
> 04:05:28
> 68438 bci Idle 1 7:00:00:00 Tue Dec 16
> 04:07:00
> 68439 bci Idle 1 7:00:00:00 Tue Dec 16
> 04:08:30
> 68440 bci Idle 1 7:00:00:00 Tue Dec 16
> 04:10:34
> 68441 bci Idle 1 7:00:00:00 Tue Dec 16
> 04:16:36
> 68442 bci Idle 1 7:00:00:00 Tue Dec 16
> 04:18:36
> 68443 bci Idle 1 7:00:00:00 Tue Dec 16
> 04:20:39
> 68444 bci Idle 1 7:00:00:00 Tue Dec 16
> 04:20:39
> 68445 bci Idle 1 7:00:00:00 Tue Dec 16
> 04:26:19
> 68446 bci Idle 1 7:00:00:00 Tue Dec 16
> 04:28:19
> 68447 bci Idle 1 7:00:00:00 Tue Dec 16
> 04:28:19
> 68448 bci Idle 1 7:00:00:00 Tue Dec 16
> 04:29:26
> 68449 bci Idle 1 7:00:00:00 Tue Dec 16
> 04:34:02
> 68450 bci Idle 1 7:00:00:00 Tue Dec 16
> 04:35:04
> 68451 bci Idle 1 7:00:00:00 Tue Dec 16
> 04:38:40
> 68452 bci Idle 1 7:00:00:00 Tue Dec 16
> 04:40:42
> 68453 bci Idle 1 7:00:00:00 Tue Dec 16
> 04:41:42
> 68454 bci Idle 1 7:00:00:00 Tue Dec 16
> 04:50:05
> 68455 bci Idle 1 7:00:00:00 Tue Dec 16
> 05:02:38
> 68456 bci Idle 1 7:00:00:00 Tue Dec 16
> 05:03:08
> 68457 bci Idle 1 7:00:00:00 Tue Dec 16
> 05:04:42
> 68458 bci Idle 1 7:00:00:00 Tue Dec 16
> 05:11:24
> 68459 bci Idle 1 7:00:00:00 Tue Dec 16
> 05:25:36
>
> Total Jobs: 54 Active Jobs: 18 Idle Jobs: 0 Blocked Jobs: 36
>
> and checkjob on one of the blocked jobs:
>
> [root at bioinfo server_logs]# checkjob 68440
> checking job 68440
>
> State: Idle
> Creds: user:bci group:bci class:serial qos:DEFAULT
> WallTime: 00:00:00 of 7:00:00:00
> SubmitTime: Tue Dec 16 04:10:34
> (Time Queued Total: 4:00:04 Eligible: 00:00:00)
>
> Total Tasks: 1
>
> Req[0] TaskCount: 1 Partition: ALL
> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
> Opsys: [NONE] Arch: [NONE] Features: [serial]
>
>
> IWD: [NONE] Executable: [NONE]
> Bypass: 0 StartCount: 4
> PartitionMask: [ALL]
> Flags: HOSTLIST RESTARTABLE
> HostList:
> [c0-70:1]
> Holds: Defer
> Messages: job cannot be started - cannot set hostlist
> PE: 1.00 StartPriority: 41
> cannot select job 68440 for partition DEFAULT (job hold active)
>
> i clearly made an error somewhere, just cannot see it. any help
> greatly apprecited.
>
> -- michael
>
>
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers
More information about the mauiusers
mailing list