[torqueusers] Cannot get more than 1 core on a node

Simon Brennan simon.brennan at ersa.edu.au
Thu Aug 11 20:55:25 MDT 2011


Hi Richard

What happens out of interest if you try submit a job to the standard 
queue using something like:

#PBS -l ncpus=2

Do you get the same error?

Regards

Simon Brennan

System Administrator
eResearchSA
University of Adelaide



On 08/11/2011 04:40 PM, Richard Young wrote:
> I am setting up a cluster environment using Torque 2.3.7 and Maui 3.2.6. These are older version however they are part cluster management software package. On the cluster I have setup 2 queues as below:
>
> #
> # Create and define queue standard
> #
> create queue standard
> set queue standard queue_type = Execution
> set queue standard Priority = 15
> set queue standard max_queuable = 60
> set queue standard max_running = 40
> set queue standard resources_max.walltime = 72:00:00
> set queue standard resources_default.neednodes = standard
> set queue standard resources_default.walltime = 24:00:00
> set queue standard max_user_run = 10
> set queue standard enabled = True
> set queue standard started = True
> #
> # Create and define queue habeus
> #
> create queue habeus
> set queue habeus queue_type = Execution
> set queue habeus Priority = 16
> set queue habeus max_queuable = 16
> set queue habeus max_running = 8
> set queue habeus resources_max.walltime = 72:00:00
> set queue habeus resources_default.neednodes = habeus
> set queue habeus resources_default.walltime = 24:00:00
> set queue habeus max_user_run = 4
> set queue habeus enabled = True
> set queue habeus started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server acl_hosts = pbs_server
> set server acl_hosts += pbs_oscar
> set server acl_hosts += hpc
> set server acl_hosts += hpc00
> set server managers = root at hpc
> set server managers += root at hpc00
> set server operators = root at hpc
> set server operators += root at hpc00
> set server default_queue = standard
> set server log_events = 64
> set server mail_from = hpcadmin
> set server query_other_jobs = True
> set server scheduler_iteration = 60
> set server node_check_rate = 150
> set server tcp_timeout = 6
> set server submit_hosts = pbs_server
> set server submit_hosts += pbs_oscar
> set server submit_hosts += hpc00
> set server submit_hosts += hpc
> set server log_file_roll_depth = 20
> set server log_keep_days = 30
> set server next_job_number = 3466
>
> Each node of the standard queue are setup like and there are 12 nodes with 8 cores on each node:
> usqhpc01 np=8 allnodes standard
> usqhpc02 np=8 allnodes standard
> usqhpc03 np=8 allnodes standard
> usqhpc04 np=8 allnodes standard
> There is only one node on the habeus queue and its setup in the nodes file is:
> habeus np=24 habeus
>
> To me both queues are setup the same however when a job is run using "#PBS -l nodes=1:ppn=2" on the standard queue it gets deferred with the following error:
>
> State: Idle  EState: Deferred
> Creds:  user:youngr  group:ict  class:standard  qos:DEFAULT
> WallTime: 00:00:00 of 00:05:00
> SubmitTime: Thu Aug 11 13:25:25
>    (Time Queued  Total: 3:33:22  Eligible: 00:00:00)
>
> Total Tasks: 2
>
> Req[0]  TaskCount: 2  Partition: ALL
> Network: [NONE]  Memory>= 0  Disk>= 0  Swap>= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [standard][hpc02]
>
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 0
> PartitionMask: [ALL]
> Flags:       RESTARTABLE
>
> job is deferred.  Reason:  NoResources  (cannot create reservation for job '3466' (intital reservation attempt)
> )
> Holds:    Defer  (hold reason:  NoResources)
> PE:  2.00  StartPriority:  32
> cannot select job 3466 for partition DEFAULT (job hold active)
>
> But when the exact same job is run on the habeus queue, the job runs and completes correctly. If I select 1 core on 1 or more nodes the same job also runs correctly. I have tried different parameters on the standard queue but it still won't run a job using more than 1 core. The maui log files don't provide any information as to what is happening.
>
> Has anybody seen this problem before and fixed it or provide some hints on how to fix it.
>
> Thank you
> ---------------------------------------------------------------------
> Richard A. Young
> Division of ICT Services
> HPC Support Officer
> University of Southern Queensland
> Toowoomba, Queensland 4350
> Australia
> Email: Richard.Young at usq.edu.au   Phone: (07) 46315557
> Mob:   0437544370          Fax:   (07) 46312798
> ---------------------------------------------------------------------
>
>
> This email (including any attached files) is confidential and is for the
> intended recipient(s) only.  If you received this email by mistake,
> please, as a courtesy, tell the sender, then delete this email.
>
> The views and opinions are the originator's and do not necessarily
> reflect those of the University of Southern Queensland.  Although all
> reasonable precautions were taken to ensure that this email contained no
> viruses at the time it was sent we accept no liability for any losses
> arising from its receipt.
>
> The University of Southern Queensland is a registered provider of
> education with the Australian Government (CRICOS Institution Code No's.
> QLD 00244B / NSW 02225M)
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list