[Mauiusers] Batch Hold for Policy Violation, which doesn't exist

Kelli Hendrickson khendrk at MIT.EDU
Wed Jul 18 09:48:48 MDT 2007


Ok, I just made a big newbie mistake, pardon my repost to correct it.

I finally got qmgr to list me the settings for the queues.  The setting 
that Lennart suggested was not set.  So I added it and restarted the 
server.  It still reports a a policy violation of 128 > 70.

This is the current setting for the queue low:
Queue low
        queue_type = Execution
        Priority = 10
        total_jobs = 2
        state_count = Transit:0 Queued:2 Held:0 Waiting:0 Running:0 
Exiting:0
        max_running = 10
        resources_max.ncpus = 70
        resources_max.nodect = 140
        resources_max.walltime = 96:00:00
        mtime = Wed Jul 18 11:33:12 2007
        resources_assigned.ncpus = 0
        resources_assigned.nodect = 0
        enabled = True
        started = True

This is the information from PBS about one of the jobs waiting because 
of the policy violation:
    Resource_List.ncpus = 1
    Resource_List.nodect = 32
    Resource_List.nodes = 32:ppn=4

What is the difference between .ncpus and .nodect?  And which one does 
the maui scheduler look at?

Thanks again to anyone who can help,
Kelli

-------------------------------------------------
Dr. K. Hendrickson
MIT Research Engineer, Vortical Flow Research Lab
khendrk at mit.edu | 617-258-7675 | 5-326B



Lennart Karlsson wrote:

>I would try to add a resources_max.nodect declaration in qmgr for
>each PBS queue, as for example:
>
>set queue short resources_max.nodect = 140
>
>This sets the upper limit on how many processors/cores you may
>use in a single job.
>
>Regarding the count of 70, I do not know why you get it. Perhaps
>it is due to a double-counting bug of Maui (see bug number 99 in
>http://clusterresources.com/bugzilla/), but I am not sure if it
>appears already in p13 of Maui-3.2.6. The bug appears sometime
>after p11 and at latest in snapshots of p14. I run a snap version
>of p16 (or p11) on our Maui systems and both are free from that
>bug. Perhaps you should upgrade to some snap version of p16 or
>later (I run maui-3.2.6p16-snap.1157560841 here)?
>
>Cheers,
>-- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
>   National Supercomputer Centre in Linkoping, Sweden
>   http://www.nsc.liu.se
>   +46 706 49 55 35
>   +46 13 28 26 24
>
>Kelli Hendrickson wrote:
>  
>
>>I've browsed the web and haven't been able to find a solution to this 
>>problem and the technical support for my cluster has somewhat left me in 
>>the wind on this issue.
>>
>>We've got Maui version 3.2.6p13, configured out of the box by my cluster 
>>vendor running on SuSE 10.1 with pbs_server 2.1.3.
>>
>>The system has 35 dual cpu, dual core compute nodes on it.  When the 
>>system is open and we submit a job using qsub -l nodes=32:ppn=4 (i.e. 
>>looking for 128 procs), the job will start immediately and run.
>>
>>However, if this job ever has to wait, the Maui scheduler puts it in 
>>Batch hold.  A checkjob reports that there is a Policy Violation, the 
>>number of procs is too high (128 > 70) - output below.
>>
>>The job will run if you use the "runjob" command but not if you do a 
>>"releasehold" command.
>>
>>The diagnose -n command reports that there are 35 nodes, and 140 procs.
>>
>>The nodeallocationpolicy is set to minresource.
>>
>>qmgr also reports the correct number of nodes/procs (output below)
>>
>>So my question is this... where is maui getting this 70 from?  
>>Obviously, the procs are available because the job runs with the use of 
>>the runjob command.  Is there another setting that was missed to make 
>>this all work correctly?  My vendor essentially suggested I change to 
>>using the pbs scheduler instead but that seems like giving up on 
>>something which seems like a simple matter.
>>
>>Any help would be greatly appreciated.
>>Thanks,
>>Kelli
>>
>>--------------------------------- checkjob output.
>>checking job 1626
>>
>>State: Idle
>>Creds:  user:kelli  group:users  class:low  qos:DEFAULT
>>WallTime: 00:00:00 of 4:00:00:00
>>SubmitTime: Tue Jul 17 11:53:55
>>  (Time Queued  Total: 00:13:16  Eligible: 00:00:01)
>>
>>Total Tasks: 128
>>
>>Req[0]  TaskCount: 128  Partition: ALL
>>Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
>>Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
>>
>>
>>IWD: [NONE]  Executable:  [NONE]
>>Bypass: 0  StartCount: 0
>>PartitionMask: [ALL]
>>Holds:    Batch  (hold reason:  PolicyViolation)
>>Messages:  procs too high (128 > 70)
>>PE:  128.00  StartPriority:  1
>>cannot select job 1626 for partition DEFAULT (job hold active)
>>---------------------------------------------------- qmgr output
>>        server_state = Active
>>        scheduling = True
>>        total_jobs = 1
>>        state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 
>>Exiting:0
>>        default_queue = short
>>        log_events = 511
>>        mail_from = adm
>>        query_other_jobs = True
>>        resources_available.ncpus = 140
>>        resources_available.nodect = 35
>>        resources_default.ncpus = 1
>>        resources_default.nodect = 1
>>        resources_assigned.ncpus = 0
>>        resources_assigned.nodect = 0
>>        scheduler_iteration = 120
>>        node_check_rate = 150
>>        tcp_timeout = 6
>>        pbs_version = 2.1.3
>>--------------------------------------------------- end
>>
>>-- 
>>-------------------------------------------------
>>Dr. K. Hendrickson
>>MIT Research Engineer, Vortical Flow Research Lab
>>khendrk at mit.edu | 617-258-7675 | 5-326B
>>    
>>
>
>
>_______________________________________________
>mauiusers mailing list
>mauiusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/mauiusers
>  
>


More information about the mauiusers mailing list