[Mauiusers] Standing Reservation Problem

Stewart.Samuels at sanofi-aventis.com Stewart.Samuels at sanofi-aventis.com
Mon Feb 13 11:26:38 MST 2006


Hello Mauiusers,
 
Following more testing I find that Maui does not seem to like specifying common subsets of nodes between two or more standing reservations.  This is a major problem if one needs, for instance, to set up access to queues using different QOS levels for nodes that are shared with other standing reservations.
 
For example, changing the srcfg configuration from my previous message (listed below) to:
 
SRCFG[prime]    CLASSLIST=prime,ghts,test,any,all
SRCFG[prime]    PERIOD=INFINITY 
SRCFG[prime]    HOSTLIST=mylnxc1-n001 

SRCFG[glide]    CLASSLIST=glide,ghts,test,any,all
SRCFG[glide]    PERIOD=INFINITY 
SRCFG[glide]    HOSTLIST=mylnxc1-n002 

#SRCFG[ghts]     CLASSLIST=ghts,test,any,all 
#SRCFG[ghts]     PERIOD=INFINITY 
#SRCFG[ghts]     HOSTLIST=mylnxc1-n00[1-2] 

Works fine.  I can submit to all queues with prime jobs going only to node mylnxc1-n001, glide jobs only going to mylnxc1-n002, and all other jobs going to either node.  But this applies all QOSLIST entries in the SRCFG to apply to all CLASSLIST entries for that SRCFG.  Where, what I really want is to apply specific QOSLIST entries to specific CLASSLIST entries to specific nodes, using multiple SRCFGs as necessary.

Is anyone doing this successfully?  If so, I would appreciate any help you can provide.

        Stewart

-----Original Message-----
From: mauiusers-bounces at supercluster.org [mailto:mauiusers-bounces at supercluster.org]On Behalf Of Stewart.Samuels at sanofi-aventis.com
Sent: Friday, February 10, 2006 5:20 PM
To: mauiusers at supercluster.org
Subject: [Mauiusers] Standing Reservation Problem



Mauiusers, 

I seem to be having trouble understanding the behavior of Maui.  We are running Maui on Torque.  I have set up queues via Torque and two Standing Reservations via Maui to direct jobs to a small cluster containing 1 Master node and 2 compute nodes.  All nodes have a single cpu and 1 GB of RAM.

The intent of my test is to execute prime jobs on mylnxc1-n001 and glide jobs on mylnxc1-n002 anytime.  Additionally, I would like to run ghts, test, any, and all jobs anytime on either node mylnxc1-n001 or mylnxc1-n002.  However, when submitting jobs to the prime or glide queues, they get stuck in the queue and never execute.  Checkjob shows they are waiting for resources but there is nothing running on the system (see below).  Jobs sent to the other queues execute properly.  If I comment out the 3rd standing reservation, then the prime and glide jobs execute properly but all other jobs now get stuck in the queues with the same message from checkjob.  It would appear that maui won't let me map multiple queues onto the nodes.  Is anyone else experiencing this behavior?

Is this a function of the policy?  I've tried a few different node policy options with the same result for all.  It doesn't seem to matter if I change it or not.  And, I have the same problem using Maui 3.2.6p11 on Torque 1.2.0p1 as well as on Maui 3.2.6p14 on Torque 2.0.0p4.

I also have the maui log set to 9 but it essentially confirms the same deferred message as checkjob.  I haven't included it in this set of data because of the volume, but I can provide it if required.

Any help would be greatly appreciated.   

               Stewart Samuels 
               Infrastructure Evolution and Integration 
               Scientific and Medical Affairs 
               Sanofi-Aventis Pharmaceutical              
               1041 Route 202-206                       
              Bridgewater, NJ  08807 

              Phone:    (908) 231-4762 
              Fax:              (908) 231-3488 
              email:            Stewart.Samuels at Sanofi-Aventis.com 

 


--------------------------------------------------------------------------------------------- 

[root at mylnxc1-a log]# qmgr -c 'p s' 
# 
# Create queues and set their attributes. 
# 
# 
# Create and define queue glide 
# 
create queue glide 
set queue glide queue_type = Execution 
set queue glide resources_max.nodect = 1 
set queue glide enabled = True 
set queue glide started = True 
# 
# Create and define queue prime 
# 
create queue prime 
set queue prime queue_type = Execution 
set queue prime resources_max.nodect = 1 
set queue prime enabled = True 
set queue prime started = True 
# 
# Create and define queue test 
# 
create queue test 
set queue test queue_type = Execution 
set queue test resources_max.nodect = 2 
set queue test enabled = True 
set queue test started = True 
# 
# Create and define queue ghts 
# 
create queue ghts 
set queue ghts queue_type = Execution 
set queue ghts resources_max.nodect = 2 
set queue ghts enabled = True 
set queue ghts started = True 
# 
# Create and define queue any 
# 
create queue any 
set queue any queue_type = Execution 
set queue any resources_max.nodect = 2 
set queue any enabled = True 
set queue any started = True 
# 
# Create and define queue all 
# 
create queue all 
set queue all queue_type = Execution 
set queue all resources_max.nodect = 2 
set queue all enabled = True 
set queue all started = True 
# 
# Set server attributes. 
# 
set server scheduling = True 
set server default_queue = ghts 
set server log_events = 511 
set server mail_from = adm 
set server query_other_jobs = True 
set server resources_default.neednodes = 1 
set server resources_default.nodect = 1 
set server resources_default.nodes = 1 
set server scheduler_iteration = 600 
set server node_ping_rate = 300 
set server node_check_rate = 600 
set server tcp_timeout = 6 
set server node_pack = False 
[root at mylnxc1-a log]# 


------------------------------------------------------------------------------------------ 

[root at mylnxc1-a log]# My maui.cfg 

QUEUETIMEWEIGHT       10 

BACKFILLPOLICY        FIRSTFIT 
RESERVATIONPOLICY     CURRENTHIGHEST 

#NODEALLOCATIONPOLICY    MINRESOURCE 
JOBNODEMATCHPOLICY      EXACTNODE 
NODEACCESSPOLICY        SHARED 

CLASSCFG[glide]         MAXPROC=1 
CLASSCFG[prime]         MAXPROC=1 
CLASSCFG[test]          MAXPROC=2 
CLASSCFG[ghts]          MAXPROC=2 
CLASSCFG[all]           MAXPROC=2 
CLASSCFG[any]           MAXPROC=2 

CREDWEIGHT      1 
CLASSWEIGHT     1 
QOSWEIGHT       1 
XFACTORWEIGHT   1 

SRCFG[prime]    CLASSLIST=prime 
SRCFG[prime]    PERIOD=INFINITY 
SRCFG[prime]    HOSTLIST=mylnxc1-n001 

SRCFG[glide]    CLASSLIST=glide 
SRCFG[glide]    PERIOD=INFINITY 
SRCFG[glide]    HOSTLIST=mylnxc1-n002 

SRCFG[ghts]     CLASSLIST=ghts,test,any,all 
SRCFG[ghts]     PERIOD=INFINITY 
SRCFG[ghts]     HOSTLIST=mylnxc1-n00[1-2] 

[nm67109 at mylnxc1-a nm67109]$ checkjob 108 


checking job 108 

State: Idle  EState: Deferred 
Creds:  user:nm67109  group:lgdgis  class:prime  qos:DEFAULT 
WallTime: 00:00:00 of 99:23:59:59 
SubmitTime: Fri Feb 10 17:04:49 
  (Time Queued  Total: 00:00:45  Eligible: 00:00:01) 

Total Tasks: 1 

Req[0]  TaskCount: 1  Partition: ALL 
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0 
Opsys: [NONE]  Arch: [NONE]  Features: [NONE] 


IWD: [NONE]  Executable:  [NONE] 
Bypass: 0  StartCount: 0 
PartitionMask: [ALL] 
Flags:       RESTARTABLE 

job is deferred.  Reason:  NoResources  (cannot create reservation for job '108' 
 (intital reservation attempt) 
) 
Holds:    Defer  (hold reason:  NoResources) 
PE:  1.00  StartPriority:  1 
cannot select job 108 for partition DEFAULT (job hold active) 


[nm67109 at mylnxc1-a nm67109]$ 



               Stewart Samuels 
               Infrastructure Evolution and Integration 
               Scientific and Medical Affairs 
               Sanofi-Aventis Pharmaceutical              
               1041 Route 202-206                       
              Bridgewater, NJ  08807 

              Phone:    (908) 231-4762 
              Fax:              (908) 231-3488 
              email:            Stewart.Samuels at Sanofi-Aventis.com 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20060213/2c4abca7/attachment-0001.html


More information about the mauiusers mailing list