Thu Aug 11 01:10:38 MDT 2011

I am setting up a cluster environment using Torque 2.3.7 and Maui 3.2.6. These are older version however they are part cluster management software package. On the cluster I have setup 2 queues as below:

# Create and define queue standard
create queue standard
set queue standard queue_type = Execution
set queue standard Priority = 15
set queue standard max_queuable = 60
set queue standard max_running = 40
set queue standard resources_max.walltime = 72:00:00
set queue standard resources_default.neednodes = standard
set queue standard resources_default.walltime = 24:00:00
set queue standard max_user_run = 10
set queue standard enabled = True
set queue standard started = True
# Create and define queue habeus
create queue habeus
set queue habeus queue_type = Execution
set queue habeus Priority = 16
set queue habeus max_queuable = 16
set queue habeus max_running = 8
set queue habeus resources_max.walltime = 72:00:00
set queue habeus resources_default.neednodes = habeus
set queue habeus resources_default.walltime = 24:00:00
set queue habeus max_user_run = 4
set queue habeus enabled = True
set queue habeus started = True
# Set server attributes.
set server scheduling = True
set server acl_hosts = pbs_server
set server acl_hosts += pbs_oscar
set server acl_hosts += hpc
set server acl_hosts += hpc00
set server managers = root at hpc
set server managers += root at hpc00
set server operators = root at hpc
set server operators += root at hpc00
set server default_queue = standard
set server log_events = 64
set server mail_from = hpcadmin
set server query_other_jobs = True
set server scheduler_iteration = 60
set server node_check_rate = 150
set server tcp_timeout = 6
set server submit_hosts = pbs_server
set server submit_hosts += pbs_oscar
set server submit_hosts += hpc00
set server submit_hosts += hpc
set server log_file_roll_depth = 20
set server log_keep_days = 30
set server next_job_number = 3466

Each node of the standard queue are setup like and there are 12 nodes with 8 cores on each node:
usqhpc01 np=8 allnodes standard
usqhpc02 np=8 allnodes standard
usqhpc03 np=8 allnodes standard
usqhpc04 np=8 allnodes standard
There is only one node on the habeus queue and its setup in the nodes file is:
habeus np=24 habeus

To me both queues are setup the same however when a job is run using "#PBS -l nodes=1:ppn=2" on the standard queue it gets deferred with the following error:

State: Idle  EState: Deferred
Creds:  user:youngr  group:ict  class:standard  qos:DEFAULT
WallTime: 00:00:00 of 00:05:00
SubmitTime: Thu Aug 11 13:25:25
  (Time Queued  Total: 3:33:22  Eligible: 00:00:00)

Total Tasks: 2

Req[0]  TaskCount: 2  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [standard][hpc02]

IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE

job is deferred.  Reason:  NoResources  (cannot create reservation for job '3466' (intital reservation attempt)
Holds:    Defer  (hold reason:  NoResources)
PE:  2.00  StartPriority:  32
cannot select job 3466 for partition DEFAULT (job hold active)

But when the exact same job is run on the habeus queue, the job runs and completes correctly. If I select 1 core on 1 or more nodes the same job also runs correctly. I have tried different parameters on the standard queue but it still won't run a job using more than 1 core. The maui log files don't provide any information as to what is happening.

Has anybody seen this problem before and fixed it or provide some hints on how to fix it.

