[Mauiusers] Standing Reservation Problem
Stewart.Samuels at sanofi-aventis.com
Stewart.Samuels at sanofi-aventis.com
Mon Feb 13 11:26:38 MST 2006
Hello Mauiusers,
Following more testing I find that Maui does not seem to like specifying common subsets of nodes between two or more standing reservations. This is a major problem if one needs, for instance, to set up access to queues using different QOS levels for nodes that are shared with other standing reservations.
For example, changing the srcfg configuration from my previous message (listed below) to:
SRCFG[prime] CLASSLIST=prime,ghts,test,any,all
SRCFG[prime] PERIOD=INFINITY
SRCFG[prime] HOSTLIST=mylnxc1-n001
SRCFG[glide] CLASSLIST=glide,ghts,test,any,all
SRCFG[glide] PERIOD=INFINITY
SRCFG[glide] HOSTLIST=mylnxc1-n002
#SRCFG[ghts] CLASSLIST=ghts,test,any,all
#SRCFG[ghts] PERIOD=INFINITY
#SRCFG[ghts] HOSTLIST=mylnxc1-n00[1-2]
Works fine. I can submit to all queues with prime jobs going only to node mylnxc1-n001, glide jobs only going to mylnxc1-n002, and all other jobs going to either node. But this applies all QOSLIST entries in the SRCFG to apply to all CLASSLIST entries for that SRCFG. Where, what I really want is to apply specific QOSLIST entries to specific CLASSLIST entries to specific nodes, using multiple SRCFGs as necessary.
Is anyone doing this successfully? If so, I would appreciate any help you can provide.
Stewart
-----Original Message-----
From: mauiusers-bounces at supercluster.org [mailto:mauiusers-bounces at supercluster.org]On Behalf Of Stewart.Samuels at sanofi-aventis.com
Sent: Friday, February 10, 2006 5:20 PM
To: mauiusers at supercluster.org
Subject: [Mauiusers] Standing Reservation Problem
Mauiusers,
I seem to be having trouble understanding the behavior of Maui. We are running Maui on Torque. I have set up queues via Torque and two Standing Reservations via Maui to direct jobs to a small cluster containing 1 Master node and 2 compute nodes. All nodes have a single cpu and 1 GB of RAM.
The intent of my test is to execute prime jobs on mylnxc1-n001 and glide jobs on mylnxc1-n002 anytime. Additionally, I would like to run ghts, test, any, and all jobs anytime on either node mylnxc1-n001 or mylnxc1-n002. However, when submitting jobs to the prime or glide queues, they get stuck in the queue and never execute. Checkjob shows they are waiting for resources but there is nothing running on the system (see below). Jobs sent to the other queues execute properly. If I comment out the 3rd standing reservation, then the prime and glide jobs execute properly but all other jobs now get stuck in the queues with the same message from checkjob. It would appear that maui won't let me map multiple queues onto the nodes. Is anyone else experiencing this behavior?
Is this a function of the policy? I've tried a few different node policy options with the same result for all. It doesn't seem to matter if I change it or not. And, I have the same problem using Maui 3.2.6p11 on Torque 1.2.0p1 as well as on Maui 3.2.6p14 on Torque 2.0.0p4.
I also have the maui log set to 9 but it essentially confirms the same deferred message as checkjob. I haven't included it in this set of data because of the volume, but I can provide it if required.
Any help would be greatly appreciated.
Stewart Samuels
Infrastructure Evolution and Integration
Scientific and Medical Affairs
Sanofi-Aventis Pharmaceutical
1041 Route 202-206
Bridgewater, NJ 08807
Phone: (908) 231-4762
Fax: (908) 231-3488
email: Stewart.Samuels at Sanofi-Aventis.com
---------------------------------------------------------------------------------------------
[root at mylnxc1-a log]# qmgr -c 'p s'
#
# Create queues and set their attributes.
#
#
# Create and define queue glide
#
create queue glide
set queue glide queue_type = Execution
set queue glide resources_max.nodect = 1
set queue glide enabled = True
set queue glide started = True
#
# Create and define queue prime
#
create queue prime
set queue prime queue_type = Execution
set queue prime resources_max.nodect = 1
set queue prime enabled = True
set queue prime started = True
#
# Create and define queue test
#
create queue test
set queue test queue_type = Execution
set queue test resources_max.nodect = 2
set queue test enabled = True
set queue test started = True
#
# Create and define queue ghts
#
create queue ghts
set queue ghts queue_type = Execution
set queue ghts resources_max.nodect = 2
set queue ghts enabled = True
set queue ghts started = True
#
# Create and define queue any
#
create queue any
set queue any queue_type = Execution
set queue any resources_max.nodect = 2
set queue any enabled = True
set queue any started = True
#
# Create and define queue all
#
create queue all
set queue all queue_type = Execution
set queue all resources_max.nodect = 2
set queue all enabled = True
set queue all started = True
#
# Set server attributes.
#
set server scheduling = True
set server default_queue = ghts
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.neednodes = 1
set server resources_default.nodect = 1
set server resources_default.nodes = 1
set server scheduler_iteration = 600
set server node_ping_rate = 300
set server node_check_rate = 600
set server tcp_timeout = 6
set server node_pack = False
[root at mylnxc1-a log]#
------------------------------------------------------------------------------------------
[root at mylnxc1-a log]# My maui.cfg
QUEUETIMEWEIGHT 10
BACKFILLPOLICY FIRSTFIT
RESERVATIONPOLICY CURRENTHIGHEST
#NODEALLOCATIONPOLICY MINRESOURCE
JOBNODEMATCHPOLICY EXACTNODE
NODEACCESSPOLICY SHARED
CLASSCFG[glide] MAXPROC=1
CLASSCFG[prime] MAXPROC=1
CLASSCFG[test] MAXPROC=2
CLASSCFG[ghts] MAXPROC=2
CLASSCFG[all] MAXPROC=2
CLASSCFG[any] MAXPROC=2
CREDWEIGHT 1
CLASSWEIGHT 1
QOSWEIGHT 1
XFACTORWEIGHT 1
SRCFG[prime] CLASSLIST=prime
SRCFG[prime] PERIOD=INFINITY
SRCFG[prime] HOSTLIST=mylnxc1-n001
SRCFG[glide] CLASSLIST=glide
SRCFG[glide] PERIOD=INFINITY
SRCFG[glide] HOSTLIST=mylnxc1-n002
SRCFG[ghts] CLASSLIST=ghts,test,any,all
SRCFG[ghts] PERIOD=INFINITY
SRCFG[ghts] HOSTLIST=mylnxc1-n00[1-2]
[nm67109 at mylnxc1-a nm67109]$ checkjob 108
checking job 108
State: Idle EState: Deferred
Creds: user:nm67109 group:lgdgis class:prime qos:DEFAULT
WallTime: 00:00:00 of 99:23:59:59
SubmitTime: Fri Feb 10 17:04:49
(Time Queued Total: 00:00:45 Eligible: 00:00:01)
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 0
PartitionMask: [ALL]
Flags: RESTARTABLE
job is deferred. Reason: NoResources (cannot create reservation for job '108'
(intital reservation attempt)
)
Holds: Defer (hold reason: NoResources)
PE: 1.00 StartPriority: 1
cannot select job 108 for partition DEFAULT (job hold active)
[nm67109 at mylnxc1-a nm67109]$
Stewart Samuels
Infrastructure Evolution and Integration
Scientific and Medical Affairs
Sanofi-Aventis Pharmaceutical
1041 Route 202-206
Bridgewater, NJ 08807
Phone: (908) 231-4762
Fax: (908) 231-3488
email: Stewart.Samuels at Sanofi-Aventis.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20060213/2c4abca7/attachment-0001.html
More information about the mauiusers
mailing list