[Mauiusers] Batch Hold for Policy Violation, which doesn't exist
Kelli Hendrickson
khendrk at MIT.EDU
Wed Jul 18 09:48:48 MDT 2007
Ok, I just made a big newbie mistake, pardon my repost to correct it.
I finally got qmgr to list me the settings for the queues. The setting
that Lennart suggested was not set. So I added it and restarted the
server. It still reports a a policy violation of 128 > 70.
This is the current setting for the queue low:
Queue low
queue_type = Execution
Priority = 10
total_jobs = 2
state_count = Transit:0 Queued:2 Held:0 Waiting:0 Running:0
Exiting:0
max_running = 10
resources_max.ncpus = 70
resources_max.nodect = 140
resources_max.walltime = 96:00:00
mtime = Wed Jul 18 11:33:12 2007
resources_assigned.ncpus = 0
resources_assigned.nodect = 0
enabled = True
started = True
This is the information from PBS about one of the jobs waiting because
of the policy violation:
Resource_List.ncpus = 1
Resource_List.nodect = 32
Resource_List.nodes = 32:ppn=4
What is the difference between .ncpus and .nodect? And which one does
the maui scheduler look at?
Thanks again to anyone who can help,
Kelli
-------------------------------------------------
Dr. K. Hendrickson
MIT Research Engineer, Vortical Flow Research Lab
khendrk at mit.edu | 617-258-7675 | 5-326B
Lennart Karlsson wrote:
>I would try to add a resources_max.nodect declaration in qmgr for
>each PBS queue, as for example:
>
>set queue short resources_max.nodect = 140
>
>This sets the upper limit on how many processors/cores you may
>use in a single job.
>
>Regarding the count of 70, I do not know why you get it. Perhaps
>it is due to a double-counting bug of Maui (see bug number 99 in
>http://clusterresources.com/bugzilla/), but I am not sure if it
>appears already in p13 of Maui-3.2.6. The bug appears sometime
>after p11 and at latest in snapshots of p14. I run a snap version
>of p16 (or p11) on our Maui systems and both are free from that
>bug. Perhaps you should upgrade to some snap version of p16 or
>later (I run maui-3.2.6p16-snap.1157560841 here)?
>
>Cheers,
>-- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
> National Supercomputer Centre in Linkoping, Sweden
> http://www.nsc.liu.se
> +46 706 49 55 35
> +46 13 28 26 24
>
>Kelli Hendrickson wrote:
>
>
>>I've browsed the web and haven't been able to find a solution to this
>>problem and the technical support for my cluster has somewhat left me in
>>the wind on this issue.
>>
>>We've got Maui version 3.2.6p13, configured out of the box by my cluster
>>vendor running on SuSE 10.1 with pbs_server 2.1.3.
>>
>>The system has 35 dual cpu, dual core compute nodes on it. When the
>>system is open and we submit a job using qsub -l nodes=32:ppn=4 (i.e.
>>looking for 128 procs), the job will start immediately and run.
>>
>>However, if this job ever has to wait, the Maui scheduler puts it in
>>Batch hold. A checkjob reports that there is a Policy Violation, the
>>number of procs is too high (128 > 70) - output below.
>>
>>The job will run if you use the "runjob" command but not if you do a
>>"releasehold" command.
>>
>>The diagnose -n command reports that there are 35 nodes, and 140 procs.
>>
>>The nodeallocationpolicy is set to minresource.
>>
>>qmgr also reports the correct number of nodes/procs (output below)
>>
>>So my question is this... where is maui getting this 70 from?
>>Obviously, the procs are available because the job runs with the use of
>>the runjob command. Is there another setting that was missed to make
>>this all work correctly? My vendor essentially suggested I change to
>>using the pbs scheduler instead but that seems like giving up on
>>something which seems like a simple matter.
>>
>>Any help would be greatly appreciated.
>>Thanks,
>>Kelli
>>
>>--------------------------------- checkjob output.
>>checking job 1626
>>
>>State: Idle
>>Creds: user:kelli group:users class:low qos:DEFAULT
>>WallTime: 00:00:00 of 4:00:00:00
>>SubmitTime: Tue Jul 17 11:53:55
>> (Time Queued Total: 00:13:16 Eligible: 00:00:01)
>>
>>Total Tasks: 128
>>
>>Req[0] TaskCount: 128 Partition: ALL
>>Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
>>Opsys: [NONE] Arch: [NONE] Features: [NONE]
>>
>>
>>IWD: [NONE] Executable: [NONE]
>>Bypass: 0 StartCount: 0
>>PartitionMask: [ALL]
>>Holds: Batch (hold reason: PolicyViolation)
>>Messages: procs too high (128 > 70)
>>PE: 128.00 StartPriority: 1
>>cannot select job 1626 for partition DEFAULT (job hold active)
>>---------------------------------------------------- qmgr output
>> server_state = Active
>> scheduling = True
>> total_jobs = 1
>> state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0
>>Exiting:0
>> default_queue = short
>> log_events = 511
>> mail_from = adm
>> query_other_jobs = True
>> resources_available.ncpus = 140
>> resources_available.nodect = 35
>> resources_default.ncpus = 1
>> resources_default.nodect = 1
>> resources_assigned.ncpus = 0
>> resources_assigned.nodect = 0
>> scheduler_iteration = 120
>> node_check_rate = 150
>> tcp_timeout = 6
>> pbs_version = 2.1.3
>>--------------------------------------------------- end
>>
>>--
>>-------------------------------------------------
>>Dr. K. Hendrickson
>>MIT Research Engineer, Vortical Flow Research Lab
>>khendrk at mit.edu | 617-258-7675 | 5-326B
>>
>>
>
>
>_______________________________________________
>mauiusers mailing list
>mauiusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/mauiusers
>
>
More information about the mauiusers
mailing list