[Mauiusers] Routing Queues (was Re Large queues cause Maui to idle)

Steve Young chemadm at hamilton.edu
Tue Dec 16 07:27:37 MST 2008


Hi Michael,
	First I would try to submit some jobs just to the execution queue to  
make sure it works. I'm wondering since it says " job cannot be  
started - cannot set hostlist" that you have a list of machines in  
your server_priv/nodes file that lists "serial" as a feature to a  
certain amount of nodes for this queue. Another thing I wonder is what  
does the batch script for the job look like? Is the user using -l  
host=<name of node> in it? I'm not for certain what the message is  
supposed to mean but it sounds like it isn't able to find any nodes to  
allocate the job to. Hope this helps,

-Steve

On Dec 16, 2008, at 8:12 AM, Michael Galloway wrote:

>
> On Wed, Dec 10, 2008 at 03:57:02PM -0500, Steve Young wrote:
>> I've used a routing queue to solve this problem. The queue that the  
>> user is
>> running on can only utilize 32 cpu's. The thousands of jobs are 1  
>> cpu each.
>> So I have this for a routing queue:
>>
>> create queue physics
>> set queue physics queue_type = Route
>> set queue physics acl_group_enable = True
>> set queue physics route_destinations += herc
>> set queue physics enabled = True
>> set queue physics started = True
>>
>> So jobs that go into here are moved to the herc execution queue.  
>> This queue
>> has the following setting:
>>
>> set queue herc max_queuable = 36
>>
>> This way only 36 jobs at time can be queue'd from the routing  
>> queue. This
>> way maui doesn't even have to worry about considering each of all the
>> thousand's of jobs each iteration. It only has to worry about  
>> scheduling
>> the jobs for the resources it has to run on.
>>
>> I also use MAXIJOB in maui:
>>
>> CLASSCFG[herc]		QLIST=md QDEF=md MAXIJOB=4
>>
>> This way even if a user had lots of jobs in the queue only their  
>> top 4 idle
>> jobs will get considered for scheduling. This way others will be  
>> able to
>> get their jobs to run without having to wait for maui to process  
>> thousands
>> of jobs that can't run yet anyhow.
>>
>
> ok, i'm in this boat as well (lots of serial jobs). i attempted to  
> implement this
> thusly:
>
> create queue sroute
> set queue sroute queue_type = Route
> set queue sroute acl_group_enable = True
> set queue sroute route_destinations = serial
> set queue sroute route_destinations += serial
> set queue sroute enabled = True
> set queue sroute started = True
>
> create queue serial
> set queue serial queue_type = Execution
> set queue serial max_queuable = 36
> set queue serial resources_max.walltime = 168:00:00
> set queue serial resources_default.neednodes = serial
> set queue serial resources_default.nodes = 12
> set queue serial resources_default.walltime = 168:00:00
> set queue serial enabled = True
> set queue serial started = True
>
> jobs are dropping from the routing queue into the serial
> queue but not running:
>
> [root at bioinfo server_logs]# qstat -q
>
> server:
>
> Queue            Memory CPU Time Walltime Node  Run Que Lm  State
> ---------------- ------ -------- -------- ----  --- --- --  -----
> annotate           --      --    18:00:00   --    0   0 --   E R
> sroute             --      --       --      --    0 8163 --   E R
> md                 --      --    168:00:0   --   18   0 --   E R
> serial             --      --    168:00:0   --    0  36 --   E R
>                                               ----- -----
>                                                  18  8199
>
>
> [root at bioinfo server_logs]# showq
> ACTIVE JOBS--------------------
> JOBNAME            USERNAME      STATE  PROC   REMAINING             
> STARTTIME
>
> 51493                   pkc    Running     1  2:13:33:53  Thu Dec 11  
> 21:41:56
> 68422                   pkc    Running     1  6:06:22:44  Mon Dec 15  
> 14:30:47
> 68423                   pkc    Running     1  6:06:22:44  Mon Dec 15  
> 14:30:47
>
>    18 Active Jobs      18 of  260 Processors Active (6.92%)
>                         5 of   66 Nodes Active      (7.58%)
>
> IDLE JOBS----------------------
> JOBNAME            USERNAME      STATE  PROC     WCLIMIT             
> QUEUETIME
>
>
> 0 Idle Jobs
>
> BLOCKED JOBS----------------
> JOBNAME            USERNAME      STATE  PROC     WCLIMIT             
> QUEUETIME
>
> 68424                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 03:38:51
> 68425                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 03:38:51
> 68426                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 03:38:51
> 68427                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 03:41:21
> 68428                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 03:45:53
> 68429                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 03:48:23
> 68430                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 03:51:53
> 68431                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 03:52:53
> 68432                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 03:58:56
> 68433                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 03:59:26
> 68434                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 03:59:26
> 68435                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 04:00:26
> 68436                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 04:00:56
> 68437                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 04:05:28
> 68438                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 04:07:00
> 68439                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 04:08:30
> 68440                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 04:10:34
> 68441                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 04:16:36
> 68442                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 04:18:36
> 68443                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 04:20:39
> 68444                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 04:20:39
> 68445                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 04:26:19
> 68446                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 04:28:19
> 68447                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 04:28:19
> 68448                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 04:29:26
> 68449                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 04:34:02
> 68450                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 04:35:04
> 68451                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 04:38:40
> 68452                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 04:40:42
> 68453                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 04:41:42
> 68454                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 04:50:05
> 68455                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 05:02:38
> 68456                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 05:03:08
> 68457                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 05:04:42
> 68458                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 05:11:24
> 68459                   bci       Idle     1  7:00:00:00  Tue Dec 16  
> 05:25:36
>
> Total Jobs: 54   Active Jobs: 18   Idle Jobs: 0   Blocked Jobs: 36
>
> and checkjob on one of the blocked jobs:
>
> [root at bioinfo server_logs]# checkjob 68440
> checking job 68440
>
> State: Idle
> Creds:  user:bci  group:bci  class:serial  qos:DEFAULT
> WallTime: 00:00:00 of 7:00:00:00
> SubmitTime: Tue Dec 16 04:10:34
>  (Time Queued  Total: 4:00:04  Eligible: 00:00:00)
>
> Total Tasks: 1
>
> Req[0]  TaskCount: 1  Partition: ALL
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [serial]
>
>
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 4
> PartitionMask: [ALL]
> Flags:       HOSTLIST RESTARTABLE
> HostList:
>  [c0-70:1]
> Holds:    Defer
> Messages:  job cannot be started - cannot set hostlist
> PE:  1.00  StartPriority:  41
> cannot select job 68440 for partition DEFAULT (job hold active)
>
> i clearly made an error somewhere, just cannot see it. any help
> greatly apprecited.
>
> -- michael
>
>
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers



More information about the mauiusers mailing list