[Mauiusers] Standing Reservation Problem

Stewart.Samuels at sanofi-aventis.com Stewart.Samuels at sanofi-aventis.com
Fri Feb 10 15:19:34 MST 2006


I seem to be having trouble understanding the behavior of Maui.  We are running Maui on Torque.  I have set up queues via Torque and two Standing Reservations via Maui to direct jobs to a small cluster containing 1 Master node and 2 compute nodes.  All nodes have a single cpu and 1 GB of RAM.

The intent of my test is to execute prime jobs on mylnxc1-n001 and glide jobs on mylnxc1-n002 anytime.  Additionally, I would like to run ghts, test, any, and all jobs anytime on either node mylnxc1-n001 or mylnxc1-n002.  However, when submitting jobs to the prime or glide queues, they get stuck in the queue and never execute.  Checkjob shows they are waiting for resources but there is nothing running on the system (see below).  Jobs sent to the other queues execute properly.  If I comment out the 3rd standing reservation, then the prime and glide jobs execute properly but all other jobs now get stuck in the queues with the same message from checkjob.  It would appear that maui won't let me map multiple queues onto the nodes.  Is anyone else experiencing this behavior?

Is this a function of the policy?  I've tried a few different node policy options with the same result for all.  It doesn't seem to matter if I change it or not.  And, I have the same problem using Maui 3.2.6p11 on Torque 1.2.0p1 as well as on Maui 3.2.6p14 on Torque 2.0.0p4.

I also have the maui log set to 9 but it essentially confirms the same deferred message as checkjob.  I haven't included it in this set of data because of the volume, but I can provide it if required.

Any help would be greatly appreciated.



[root at mylnxc1-a log]# qmgr -c 'p s'
# Create queues and set their attributes.
# Create and define queue glide
create queue glide
set queue glide queue_type = Execution
set queue glide resources_max.nodect = 1
set queue glide enabled = True
set queue glide started = True
# Create and define queue prime
create queue prime
set queue prime queue_type = Execution
set queue prime resources_max.nodect = 1
set queue prime enabled = True
set queue prime started = True
# Create and define queue test
create queue test
set queue test queue_type = Execution
set queue test resources_max.nodect = 2
set queue test enabled = True
set queue test started = True
# Create and define queue ghts
create queue ghts
set queue ghts queue_type = Execution
set queue ghts resources_max.nodect = 2
set queue ghts enabled = True
set queue ghts started = True
# Create and define queue any
create queue any
set queue any queue_type = Execution
set queue any resources_max.nodect = 2
set queue any enabled = True
set queue any started = True
# Create and define queue all
create queue all
set queue all queue_type = Execution
set queue all resources_max.nodect = 2
set queue all enabled = True
set queue all started = True
# Set server attributes.
set server scheduling = True
set server default_queue = ghts
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.neednodes = 1
set server resources_default.nodect = 1
set server resources_default.nodes = 1
set server scheduler_iteration = 600
set server node_ping_rate = 300
set server node_check_rate = 600
set server tcp_timeout = 6
set server node_pack = False
[root at mylnxc1-a log]#


[root at mylnxc1-a log]# My maui.cfg




CLASSCFG[glide]         MAXPROC=1
CLASSCFG[prime]         MAXPROC=1
CLASSCFG[test]          MAXPROC=2
CLASSCFG[ghts]          MAXPROC=2
CLASSCFG[all]           MAXPROC=2
CLASSCFG[any]           MAXPROC=2


SRCFG[prime]    CLASSLIST=prime
SRCFG[prime]    HOSTLIST=mylnxc1-n001

SRCFG[glide]    CLASSLIST=glide
SRCFG[glide]    HOSTLIST=mylnxc1-n002

SRCFG[ghts]     CLASSLIST=ghts,test,any,all
SRCFG[ghts]     HOSTLIST=mylnxc1-n00[1-2]

[nm67109 at mylnxc1-a nm67109]$ checkjob 108

checking job 108

State: Idle  EState: Deferred
Creds:  user:nm67109  group:lgdgis  class:prime  qos:DEFAULT
WallTime: 00:00:00 of 99:23:59:59
SubmitTime: Fri Feb 10 17:04:49
  (Time Queued  Total: 00:00:45  Eligible: 00:00:01)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]

IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE

job is deferred.  Reason:  NoResources  (cannot create reservation for job '108'
 (intital reservation attempt)
Holds:    Defer  (hold reason:  NoResources)
PE:  1.00  StartPriority:  1
cannot select job 108 for partition DEFAULT (job hold active)

[nm67109 at mylnxc1-a nm67109]$

               Stewart Samuels
               Infrastructure Evolution and Integration
               Scientific and Medical Affairs 
               Sanofi-Aventis Pharmaceutical              
               1041 Route 202-206			
              Bridgewater, NJ  08807

              Phone:	(908) 231-4762
              Fax:		(908) 231-3488
              email:		Stewart.Samuels at Sanofi-Aventis.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20060210/1eb27c78/attachment.html

More information about the mauiusers mailing list