[torqueusers] FW: Cannot get more than 1 core on a node

Richard Young Richard.Young at usq.edu.au
Sun Aug 14 17:20:20 MDT 2011


Simon
Changing to "#PBS -l ncpus=2" does allow the job to run however the job only runs on 1 core on 1 node. This happens whether it's on the standard or habeus queues. Whereas "#PBS -l nodes=1:ppn=2" runs on the habeus queue but not on the standard queue. In the test script I am using I have the line "/bin/cat $PBS_NODEFILE" which tells me what nodes the job is running.

---------------------------------------------------------------------
Richard A. Young
Division of ICT Services
Email: Richard.Young at usq.edu.au   Phone: (07) 46315557   
Mob:   0437544370          Fax:   (07) 46312798 
---------------------------------------------------------------------


-----Original Message-----
From: Simon Brennan [mailto:simon.brennan at ersa.edu.au] 
Sent: Friday, 12 August 2011 12:55 PM
To: Torque Users Mailing List
Cc: Richard Young
Subject: Re: [torqueusers] Cannot get more than 1 core on a node

Hi Richard

What happens out of interest if you try submit a job to the standard 
queue using something like:

#PBS -l ncpus=2

Do you get the same error?

Regards

Simon Brennan

System Administrator
eResearchSA
University of Adelaide



On 08/11/2011 04:40 PM, Richard Young wrote:
> I am setting up a cluster environment using Torque 2.3.7 and Maui 3.2.6. These are older version however they are part cluster management software package. On the cluster I have setup 2 queues as below:
>
> #
> # Create and define queue standard
> #
> create queue standard
> set queue standard queue_type = Execution
> set queue standard Priority = 15
> set queue standard max_queuable = 60
> set queue standard max_running = 40
> set queue standard resources_max.walltime = 72:00:00
> set queue standard resources_default.neednodes = standard
> set queue standard resources_default.walltime = 24:00:00
> set queue standard max_user_run = 10
> set queue standard enabled = True
> set queue standard started = True
> #
> # Create and define queue habeus
> #
> create queue habeus
> set queue habeus queue_type = Execution
> set queue habeus Priority = 16
> set queue habeus max_queuable = 16
> set queue habeus max_running = 8
> set queue habeus resources_max.walltime = 72:00:00
> set queue habeus resources_default.neednodes = habeus
> set queue habeus resources_default.walltime = 24:00:00
> set queue habeus max_user_run = 4
> set queue habeus enabled = True
> set queue habeus started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server acl_hosts = pbs_server
> set server acl_hosts += pbs_oscar
> set server acl_hosts += hpc
> set server acl_hosts += hpc00
> set server managers = root at hpc
> set server managers += root at hpc00
> set server operators = root at hpc
> set server operators += root at hpc00
> set server default_queue = standard
> set server log_events = 64
> set server mail_from = hpcadmin
> set server query_other_jobs = True
> set server scheduler_iteration = 60
> set server node_check_rate = 150
> set server tcp_timeout = 6
> set server submit_hosts = pbs_server
> set server submit_hosts += pbs_oscar
> set server submit_hosts += hpc00
> set server submit_hosts += hpc
> set server log_file_roll_depth = 20
> set server log_keep_days = 30
> set server next_job_number = 3466
>
> Each node of the standard queue are setup like and there are 12 nodes with 8 cores on each node:
> usqhpc01 np=8 allnodes standard
> usqhpc02 np=8 allnodes standard
> usqhpc03 np=8 allnodes standard
> usqhpc04 np=8 allnodes standard
> There is only one node on the habeus queue and its setup in the nodes file is:
> habeus np=24 habeus
>
> To me both queues are setup the same however when a job is run using "#PBS -l nodes=1:ppn=2" on the standard queue it gets deferred with the following error:
>
> State: Idle  EState: Deferred
> Creds:  user:youngr  group:ict  class:standard  qos:DEFAULT
> WallTime: 00:00:00 of 00:05:00
> SubmitTime: Thu Aug 11 13:25:25
>    (Time Queued  Total: 3:33:22  Eligible: 00:00:00)
>
> Total Tasks: 2
>
> Req[0]  TaskCount: 2  Partition: ALL
> Network: [NONE]  Memory>= 0  Disk>= 0  Swap>= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [standard][hpc02]
>
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 0
> PartitionMask: [ALL]
> Flags:       RESTARTABLE
>
> job is deferred.  Reason:  NoResources  (cannot create reservation for job '3466' (intital reservation attempt)
> )
> Holds:    Defer  (hold reason:  NoResources)
> PE:  2.00  StartPriority:  32
> cannot select job 3466 for partition DEFAULT (job hold active)
>
> But when the exact same job is run on the habeus queue, the job runs and completes correctly. If I select 1 core on 1 or more nodes the same job also runs correctly. I have tried different parameters on the standard queue but it still won't run a job using more than 1 core. The maui log files don't provide any information as to what is happening.
>
> Has anybody seen this problem before and fixed it or provide some hints on how to fix it.
>
> Thank you
> ---------------------------------------------------------------------
> Richard A. Young
> Division of ICT Services
> HPC Support Officer
> University of Southern Queensland
> Toowoomba, Queensland 4350
> Australia
> Email: Richard.Young at usq.edu.au   Phone: (07) 46315557
> Mob:   0437544370          Fax:   (07) 46312798
> ---------------------------------------------------------------------
>
>
> This email (including any attached files) is confidential and is for the
> intended recipient(s) only.  If you received this email by mistake,
> please, as a courtesy, tell the sender, then delete this email.
>
> The views and opinions are the originator's and do not necessarily
> reflect those of the University of Southern Queensland.  Although all
> reasonable precautions were taken to ensure that this email contained no
> viruses at the time it was sent we accept no liability for any losses
> arising from its receipt.
>
> The University of Southern Queensland is a registered provider of
> education with the Australian Government (CRICOS Institution Code No's.
> QLD 00244B / NSW 02225M)
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

This email (including any attached files) is confidential and is for the
intended recipient(s) only.  If you received this email by mistake,
please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily
reflect those of the University of Southern Queensland.  Although all
reasonable precautions were taken to ensure that this email contained no
viruses at the time it was sent we accept no liability for any losses
arising from its receipt.

The University of Southern Queensland is a registered provider of
education with the Australian Government (CRICOS Institution Code No's.
QLD 00244B / NSW 02225M)




More information about the torqueusers mailing list