[Mauiusers] Batch Hold for Policy Violation, which doesn't exist

Kelli Hendrickson khendrk at MIT.EDU
Tue Jul 17 10:13:11 MDT 2007


Hello,
I've browsed the web and haven't been able to find a solution to this 
problem and the technical support for my cluster has somewhat left me in 
the wind on this issue.

We've got Maui version 3.2.6p13, configured out of the box by my cluster 
vendor running on SuSE 10.1 with pbs_server 2.1.3.

The system has 35 dual cpu, dual core compute nodes on it.  When the 
system is open and we submit a job using qsub -l nodes=32:ppn=4 (i.e. 
looking for 128 procs), the job will start immediately and run.

However, if this job ever has to wait, the Maui scheduler puts it in 
Batch hold.  A checkjob reports that there is a Policy Violation, the 
number of procs is too high (128 > 70) - output below.

The job will run if you use the "runjob" command but not if you do a 
"releasehold" command.

The diagnose -n command reports that there are 35 nodes, and 140 procs.

The nodeallocationpolicy is set to minresource.

qmgr also reports the correct number of nodes/procs (output below)

So my question is this... where is maui getting this 70 from?  
Obviously, the procs are available because the job runs with the use of 
the runjob command.  Is there another setting that was missed to make 
this all work correctly?  My vendor essentially suggested I change to 
using the pbs scheduler instead but that seems like giving up on 
something which seems like a simple matter.

Any help would be greatly appreciated.
Thanks,
Kelli

--------------------------------- checkjob output.
checking job 1626

State: Idle
Creds:  user:kelli  group:users  class:low  qos:DEFAULT
WallTime: 00:00:00 of 4:00:00:00
SubmitTime: Tue Jul 17 11:53:55
  (Time Queued  Total: 00:13:16  Eligible: 00:00:01)

Total Tasks: 128

Req[0]  TaskCount: 128  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Holds:    Batch  (hold reason:  PolicyViolation)
Messages:  procs too high (128 > 70)
PE:  128.00  StartPriority:  1
cannot select job 1626 for partition DEFAULT (job hold active)
---------------------------------------------------- qmgr output
        server_state = Active
        scheduling = True
        total_jobs = 1
        state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 
Exiting:0
        default_queue = short
        log_events = 511
        mail_from = adm
        query_other_jobs = True
        resources_available.ncpus = 140
        resources_available.nodect = 35
        resources_default.ncpus = 1
        resources_default.nodect = 1
        resources_assigned.ncpus = 0
        resources_assigned.nodect = 0
        scheduler_iteration = 120
        node_check_rate = 150
        tcp_timeout = 6
        pbs_version = 2.1.3
--------------------------------------------------- end

-- 
-------------------------------------------------
Dr. K. Hendrickson
MIT Research Engineer, Vortical Flow Research Lab
khendrk at mit.edu | 617-258-7675 | 5-326B



More information about the mauiusers mailing list