[Mauiusers] Batch Hold for Policy Violation, which doesn't exist

Kelli Hendrickson khendrk at MIT.EDU
Wed Jul 18 09:14:24 MDT 2007


Lennart,
Thank you for the input.  I attempted the following with no luck.

1) as the queues were set up in maui by the vendor, I set the 
resources_max.nodect for the server by the command
set server resources_max.nodect = 140.  restarted PBS on the master.  
repeat the test and get the same output from checkjob & diagnose 
(diagnose output included below).

2) dug into bugzilla report on bug 99 as you suggested.  i'm not quite 
sure that this is the exact problem that i'm experiencing as the 
diagnose reports just that its been put on batch hold rather than 
violating a maxproc limit.

I'm including the maui.cfg as well if this can provide some insight to 
anyone.

Thanks,
Kelli

------------------------------------------------ diagnose output
Diagnosing blocked jobs (policylevel SOFT  partition ALL)

job 1637                 has the following hold(s) in place:  Batch
------------------------------------------------ maui.cfg
# maui.cfg 3.2.6p13

SERVERHOST            master
# primary admin must be first in list
ADMIN1                root

# Resource Manager Definition

RMCFG[MASTER] TYPE=PBS

# Allocation Manager Definition

AMCFG[bank]  TYPE=NONE

# full parameter docs at 
http://clusterresources.com/mauidocs/a.fparameters.html
# use the 'schedctl -l' command to display current configuration

RMPOLLINTERVAL        00:00:30

SERVERPORT            42559
SERVERMODE            NORMAL

# Admin: http://clusterresources.com/mauidocs/a.esecurity.html

LOGFILE               maui.log
LOGFILEMAXSIZE        10000000
LOGLEVEL              3

# Job Priority: 
http://clusterresources.com/mauidocs/5.1jobprioritization.html

QUEUETIMEWEIGHT       1

# FairShare: http://clusterresources.com/mauidocs/6.3fairshare.html

#FSPOLICY              PSDEDICATED
#FSDEPTH               7
#FSINTERVAL            86400
#FSDECAY               0.80

# Throttling Policies: 
http://clusterresources.com/mauidocs/6.2throttlingpolicies.html
# NONE SPECIFIED

# Backfill: http://clusterresources.com/mauidocs/8.2backfill.html

BACKFILLPOLICY        FIRSTFIT
RESERVATIONPOLICY     CURRENTHIGHEST

# Node Allocation: 
http://clusterresources.com/mauidocs/5.2nodeallocation.html

NODEALLOCATIONPOLICY  MINRESOURCE

# QOS: http://clusterresources.com/mauidocs/7.3qos.html

# QOSCFG[hi]  PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB
# QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE

# Standing Reservations: 
http://clusterresources.com/mauidocs/7.1.3standingreservations.ht
ml

# SRSTARTTIME[test] 8:00:00
# SRENDTIME[test]   17:00:00
# SRDAYS[test]      MON TUE WED THU FRI
# SRTASKCOUNT[test] 20
# SRMAXTIME[test]   0:30:00

# Creds: http://clusterresources.com/mauidocs/6.1fairnessoverview.html

# USERCFG[DEFAULT]      FSTARGET=25.0
# USERCFG[john]         PRIORITY=100  FSTARGET=10.0-
# GROUPCFG[staff]       PRIORITY=1000 QLIST=hi:low QDEF=hi
# CLASSCFG[batch]       FLAGS=PREEMPTEE
# CLASSCFG[interactive] FLAGS=PREEMPTOR

SRCFG[short]   CLASSLIST=short  PRIORITY=100 PERIOD=INFINITY
SRCFG[low]     CLASSLIST=low    PRIORITY=10  PERIOD=INFINITY
SRCFG[medium]  CLASSLIST=medium PRIORITY=40  PERIOD=INFINITY
SRCFG[high]    CLASSLIST=high   PRIORITY=70  PERIOD=INFINITY

SRCFG[short]   CLASSLIST=short  PRIORITY=100 PERIOD=INFINITY
SRCFG[low]     CLASSLIST=low    PRIORITY=10  PERIOD=INFINITY
SRCFG[medium]  CLASSLIST=medium PRIORITY=40  PERIOD=INFINITY
SRCFG[high]    CLASSLIST=high   PRIORITY=70  PERIOD=INFINITY

SRCFG[short]   CLASSLIST=short  PRIORITY=100 PERIOD=INFINITY
SRCFG[low]     CLASSLIST=low    PRIORITY=10  PERIOD=INFINITY
SRCFG[medium]  CLASSLIST=medium PRIORITY=40  PERIOD=INFINITY
SRCFG[high]    CLASSLIST=high   PRIORITY=70  PERIOD=INFINITY
----------------------------------------------------------- end maui.cfg

-------------------------------------------------
Dr. K. Hendrickson
MIT Research Engineer, Vortical Flow Research Lab
khendrk at mit.edu | 617-258-7675 | 5-326B



Lennart Karlsson wrote:

>I would try to add a resources_max.nodect declaration in qmgr for
>each PBS queue, as for example:
>
>set queue short resources_max.nodect = 140
>
>This sets the upper limit on how many processors/cores you may
>use in a single job.
>
>Regarding the count of 70, I do not know why you get it. Perhaps
>it is due to a double-counting bug of Maui (see bug number 99 in
>http://clusterresources.com/bugzilla/), but I am not sure if it
>appears already in p13 of Maui-3.2.6. The bug appears sometime
>after p11 and at latest in snapshots of p14. I run a snap version
>of p16 (or p11) on our Maui systems and both are free from that
>bug. Perhaps you should upgrade to some snap version of p16 or
>later (I run maui-3.2.6p16-snap.1157560841 here)?
>
>Cheers,
>-- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
>   National Supercomputer Centre in Linkoping, Sweden
>   http://www.nsc.liu.se
>   +46 706 49 55 35
>   +46 13 28 26 24
>
>Kelli Hendrickson wrote:
>  
>
>>I've browsed the web and haven't been able to find a solution to this 
>>problem and the technical support for my cluster has somewhat left me in 
>>the wind on this issue.
>>
>>We've got Maui version 3.2.6p13, configured out of the box by my cluster 
>>vendor running on SuSE 10.1 with pbs_server 2.1.3.
>>
>>The system has 35 dual cpu, dual core compute nodes on it.  When the 
>>system is open and we submit a job using qsub -l nodes=32:ppn=4 (i.e. 
>>looking for 128 procs), the job will start immediately and run.
>>
>>However, if this job ever has to wait, the Maui scheduler puts it in 
>>Batch hold.  A checkjob reports that there is a Policy Violation, the 
>>number of procs is too high (128 > 70) - output below.
>>
>>The job will run if you use the "runjob" command but not if you do a 
>>"releasehold" command.
>>
>>The diagnose -n command reports that there are 35 nodes, and 140 procs.
>>
>>The nodeallocationpolicy is set to minresource.
>>
>>qmgr also reports the correct number of nodes/procs (output below)
>>
>>So my question is this... where is maui getting this 70 from?  
>>Obviously, the procs are available because the job runs with the use of 
>>the runjob command.  Is there another setting that was missed to make 
>>this all work correctly?  My vendor essentially suggested I change to 
>>using the pbs scheduler instead but that seems like giving up on 
>>something which seems like a simple matter.
>>
>>Any help would be greatly appreciated.
>>Thanks,
>>Kelli
>>
>>--------------------------------- checkjob output.
>>checking job 1626
>>
>>State: Idle
>>Creds:  user:kelli  group:users  class:low  qos:DEFAULT
>>WallTime: 00:00:00 of 4:00:00:00
>>SubmitTime: Tue Jul 17 11:53:55
>>  (Time Queued  Total: 00:13:16  Eligible: 00:00:01)
>>
>>Total Tasks: 128
>>
>>Req[0]  TaskCount: 128  Partition: ALL
>>Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
>>Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
>>
>>
>>IWD: [NONE]  Executable:  [NONE]
>>Bypass: 0  StartCount: 0
>>PartitionMask: [ALL]
>>Holds:    Batch  (hold reason:  PolicyViolation)
>>Messages:  procs too high (128 > 70)
>>PE:  128.00  StartPriority:  1
>>cannot select job 1626 for partition DEFAULT (job hold active)
>>---------------------------------------------------- qmgr output
>>        server_state = Active
>>        scheduling = True
>>        total_jobs = 1
>>        state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 
>>Exiting:0
>>        default_queue = short
>>        log_events = 511
>>        mail_from = adm
>>        query_other_jobs = True
>>        resources_available.ncpus = 140
>>        resources_available.nodect = 35
>>        resources_default.ncpus = 1
>>        resources_default.nodect = 1
>>        resources_assigned.ncpus = 0
>>        resources_assigned.nodect = 0
>>        scheduler_iteration = 120
>>        node_check_rate = 150
>>        tcp_timeout = 6
>>        pbs_version = 2.1.3
>>--------------------------------------------------- end
>>
>>-- 
>>-------------------------------------------------
>>Dr. K. Hendrickson
>>MIT Research Engineer, Vortical Flow Research Lab
>>khendrk at mit.edu | 617-258-7675 | 5-326B
>>    
>>
>
>
>_______________________________________________
>mauiusers mailing list
>mauiusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/mauiusers
>  
>


More information about the mauiusers mailing list