[Mauiusers] Simple Torque+Maui setup: jobs stay queued, no resources

Sebastiaan Breedveld s.breedveld at erasmusmc.nl
Thu Apr 5 05:32:40 MDT 2012


Dear list,

I am trying to setup a very basic Torque+Maui system. I am running a 
Torque cluster for a year now, and wanted to improve the scheduling with 
Maui. To this end, I installed a fresh test-system, with server and node 
on a single computer.

Torque version: 2.4.16
Maui version: 3.3.1
uname: Linux testing 3.2.0-20-generic #33-Ubuntu SMP Tue Mar 27 16:42:26 
UTC 2012 x86_64 x86_64 x86_64 GNU/Linux



I was able to run (simple) jobs with the Torque scheduler. When I 
replaced the scheduler with Maui, jobs stay queued. Jobs are submitted by:

$ qsub -q batch test-script.sh

where test-script.sh is nothing more than a 'sleep 1m' script. Checking 
the job:

# checkjob -v 55
checking job 55 (RM job '55.testing.azr.nl')

State: Idle  EState: Deferred
Creds:  user:sebastiaan  group:sebastiaan  class:batch  qos:DEFAULT
WallTime: 00:00:00 of 6:00:00
SubmitTime: Thu Apr  5 13:21:33
   (Time Queued  Total: 00:00:32  Eligible: 00:00:01)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 15G
Opsys: [NONE]  Arch: [NONE]  Features: [1][ppn=1]
Exec:  ''  ExecSize: 0  ImageSize: 0
Dedicated Resources Per Task: PROCS: 1  MEM: 2000M  SWAP: 15G
NodeAccess: SHARED
NodeCount: 1


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE

job is deferred.  Reason:  NoResources  (cannot create reservation for 
job '55' (intital reservation attempt)
)
Holds:    Defer  (hold reason:  NoResources)
PE:  16.03  StartPriority:  1
cannot select job 55 for partition DEFAULT (job hold active)



show that there are no resources available. The node is free, and unloaded:

# checknode testing


checking node testing.azr.nl

State:      Idle  (in current state for 2:23:54)
Configured Resources: PROCS: 2  MEM: 984M  SWAP: 1996M  DISK: 1M
Utilized   Resources: SWAP: 149M
Dedicated  Resources: [NONE]
Opsys:         linux  Arch:      [NONE]
Speed:      1.00  Load:       0.050
Network:    [DEFAULT]
Features:   [NONE]
Attributes: [Batch]
Classes:    [batch 2:2]

Total Time: 16:11:49  Up: 16:11:49 (100.00%)  Active: 00:01:00 (0.10%)

Reservations:
NOTE:  no reservations on node



When the job is added, maui.log shows this:
04/05 13:21:34 MPBSJobLoad(55,55.testing.azr.nl,J,TaskList,0)
04/05 13:21:34 MReqCreate(55,SrcRQ,DstRQ,DoCreate)
04/05 13:21:34 INFO:     processing node request line '1'
04/05 13:21:34 MJobSetCreds(55,sebastiaan,sebastiaan,)
04/05 13:21:34 INFO:     default QOS for job 55 set to DEFAULT(0) 
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
04/05 13:21:34 INFO:     default QOS for job 55 set to DEFAULT(0) 
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
04/05 13:21:34 INFO:     default QOS for job 55 set to DEFAULT(0) 
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
04/05 13:21:34 INFO:     job '55' loaded:   1 sebastiaan sebastiaan  
21600       Idle   0 1333624893   [NONE] [NONE] [NONE] >=      0 >=      
0 [1][ppn=1] 1333624894
04/05 13:21:34 INFO:     12 PBS jobs detected on RM TESTING
04/05 13:21:34 INFO:     jobs detected: 12
04/05 13:21:34 MStatClearUsage(node,Active)
04/05 13:21:34 MClusterUpdateNodeState()
04/05 13:21:34 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg)
04/05 13:21:34 INFO:     job '40' Priority:       22
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:     22(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '41' Priority:       22
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:     22(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '42' Priority:       22
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:     22(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '44' Priority:       22
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:     22(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '45' Priority:       22
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:     22(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '47' Priority:       22
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:     22(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '48' Priority:       16
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:     16(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '49' Priority:       12
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:     12(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '52' Priority:        8
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:      8(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '53' Priority:        1
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:      0(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '54' Priority:       60
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:     60(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '55' Priority:        1
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:      0(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 MStatClearUsage([NONE],Active)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 INFO:     total jobs selected (ALL): 1/12 [EState: 11]
04/05 13:21:34 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg)
04/05 13:21:34 INFO:     job '40' Priority:       22
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:     22(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '41' Priority:       22
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:     22(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '42' Priority:       22
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:     22(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '44' Priority:       22
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:     22(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '45' Priority:       22
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:     22(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '47' Priority:       22
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:     22(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '48' Priority:       16
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:     16(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '49' Priority:       12
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:     12(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '52' Priority:        8
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:      8(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '53' Priority:        1
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:      0(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '54' Priority:       60
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:     60(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 INFO:     job '55' Priority:        1
04/05 13:21:34 INFO:     Cred:      0(00.0)  FS:      0(00.0)  
Attr:      0(00.0)  Serv:      0(00.0)  Targ:      0(00.0)  Res:      
0(00.0)  Us:      0(00.0)
04/05 13:21:34 MStatClearUsage([NONE],Idle)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 MResDestroy(NULL)
04/05 13:21:34 INFO:     total jobs selected (ALL): 1/12 [EState: 11]
04/05 13:21:34 
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE)
04/05 13:21:34 INFO:     total jobs selected in partition ALL: 1/1
04/05 13:21:34 MQueueScheduleRJobs(Q)
04/05 13:21:34 
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
04/05 13:21:34 INFO:     total jobs selected in partition ALL: 1/1
04/05 13:21:34 
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE)
04/05 13:21:34 INFO:     total jobs selected in partition DEFAULT: 1/1
04/05 13:21:34 MQueueScheduleIJobs(Q,DEFAULT)
04/05 13:21:34 INFO:     0 feasible tasks found for job 55:0 in 
partition DEFAULT (1 Needed)
04/05 13:21:34 MJobPReserve(55,DEFAULT,ResCount,ResCountRej)
04/05 13:21:34 MJobReserve(55,Priority)
04/05 13:21:34 ALERT:    job 55 cannot run in any partition
04/05 13:21:34 ALERT:    cannot create new reservation for job 55 
(shape[1] 1)
04/05 13:21:34 ALERT:    cannot create new reservation for job 55
04/05 13:21:34 MJobSetHold(55,16,1:00:00,NoResources,cannot create 
reservation for job '55' (intital reservation attempt)
)
04/05 13:21:34 ALERT:    job '55' cannot run (deferring job for 3600 
seconds)
04/05 13:21:34 WARNING:  cannot reserve priority job '55'
Active Jobs------
------------------
04/05 13:21:34 INFO:     resources available after scheduling: N: 1  P: 2
04/05 13:21:34 
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE)
04/05 13:21:34 INFO:     total jobs selected in partition DEFAULT: 0/1 
[EState: 1]
04/05 13:21:34 
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,TRUE)
04/05 13:21:34 INFO:     total jobs selected in partition ALL: 0/1 
[EState: 1]
04/05 13:21:34 
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
04/05 13:21:34 INFO:     total jobs selected in partition ALL: 0/1 
[EState: 1]
04/05 13:21:34 MSchedUpdateStats()
04/05 13:21:34 INFO:     iteration:  288   scheduling time:  0.002 seconds
04/05 13:21:34 MResUpdateStats()
04/05 13:21:34 INFO:     current util[288]:  0/1 (0.00%)  PH: 0.00%  
active jobs: 0 of 2 (completed: 1)
04/05 13:21:34 MQueueCheckStatus()
04/05 13:21:34 MNodeCheckStatus()
04/05 13:21:34 MUClearChild(PID)
04/05 13:21:34 INFO:     scheduling complete.  sleeping 30 seconds


I think the relevant line is:
04/05 13:21:34 INFO:     0 feasible tasks found for job 55:0 in 
partition DEFAULT (1 Needed)

but I have no idea how to make a feasible task for the job. I have tried 
queueing with -l nodes=1:ppn=1 -l walltime=2:00:00, etc. but none seem 
to have had effect.



Torque config. I have tried setting different attributes to the queue 
properties, hoping that it would have some effect:
# qmgr -c "p s"
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch Priority = 20
set queue batch max_running = 8
set queue batch resources_max.ncpus = 8
set queue batch resources_max.nodect = 10
set queue batch resources_max.nodes = 2
set queue batch resources_min.ncpus = 0
set queue batch resources_default.mem = 2000mb
set queue batch resources_default.ncpus = 1
set queue batch resources_default.neednodes = 1:ppn=1
set queue batch resources_default.nodect = 1
set queue batch resources_default.nodes = 1
set queue batch resources_default.pvmem = 16000mb
set queue batch resources_default.walltime = 06:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = testing.azr.nl
set server log_events = 511
set server mail_from = adm
set server resources_available.nodect = 10
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 56


Maui configuration, untouched:
# maui.cfg 3.3.1

SERVERHOST            testing
# primary admin must be first in list
ADMIN1                root

# Resource Manager Definition

RMCFG[TESTING] TYPE=PBS

# Allocation Manager Definition

AMCFG[bank]  TYPE=NONE

# full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html
# use the 'schedctl -l' command to display current configuration

RMPOLLINTERVAL        00:00:30

SERVERPORT            42559
SERVERMODE            NORMAL

# Admin: http://supercluster.org/mauidocs/a.esecurity.html


LOGFILE               maui.log
LOGFILEMAXSIZE        10000000
LOGLEVEL              3

# Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html

QUEUETIMEWEIGHT       1

# FairShare: http://supercluster.org/mauidocs/6.3fairshare.html

#FSPOLICY              PSDEDICATED
#FSDEPTH               7
#FSINTERVAL            86400
#FSDECAY               0.80

# Throttling Policies: 
http://supercluster.org/mauidocs/6.2throttlingpolicies.html

# NONE SPECIFIED

# Backfill: http://supercluster.org/mauidocs/8.2backfill.html

BACKFILLPOLICY        FIRSTFIT
RESERVATIONPOLICY     CURRENTHIGHEST

# Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html

NODEALLOCATIONPOLICY  MINRESOURCE

# QOS: http://supercluster.org/mauidocs/7.3qos.html

# QOSCFG[hi]  PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB
# QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE

# Standing Reservations: 
http://supercluster.org/mauidocs/7.1.3standingreservations.html

# SRSTARTTIME[test] 8:00:00
# SRENDTIME[test]   17:00:00
# SRDAYS[test]      MON TUE WED THU FRI
# SRTASKCOUNT[test] 20
# SRMAXTIME[test]   0:30:00

# Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html

# USERCFG[DEFAULT]      FSTARGET=25.0
# USERCFG[john]         PRIORITY=100  FSTARGET=10.0-
# GROUPCFG[staff]       PRIORITY=1000 QLIST=hi:low QDEF=hi
# CLASSCFG[batch]       FLAGS=PREEMPTEE
# CLASSCFG[interactive] FLAGS=PREEMPTOR



Any ideas?

Thanks in advance,
Sebastiaan


-- 
Sebastiaan Breedveld, MSc.
Ph.D. student

Erasmus MC - Daniel den Hoed Cancer Center
Department of Radiation Oncology

Groene Hilledijk 301
3075 EA Rotterdam
The Netherlands

Phone: +31 10 7042693
Room: Gs-20

-------------- next part --------------
A non-text attachment was scrubbed...
Name: s_breedveld.vcf
Type: text/x-vcard
Size: 365 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20120405/b806864c/attachment-0001.vcf 


More information about the mauiusers mailing list