[torqueusers] Cannot get more than 1 core on a node

Richard Young Richard.Young at usq.edu.au
Thu Aug 11 20:41:38 MDT 2011


Gus
There is nothing that I know of in maui.cfg that would cause this problem as configuration changes have been kept to a minimum just to get things working. I have included my maui.cfg below. pbsnodes does report all the nodes have the correct number of cores
# maui.cfg 3.2.6p19

SERVERHOST              hpc00.usq.edu.au

# primary admin must be first in list
ADMIN1                  root

# Resource Manager Definition

RMCFG[hpc00.usq.edu.au] TYPE=PBS

# Allocation Manager Definition

AMCFG[bank]  TYPE=NONE

# full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html
# use the 'schedctl -l' command to display current configuration

RMPOLLINTERVAL  00:00:10

SERVERPORT            42559
SERVERMODE            NORMAL

# Admin: http://supercluster.org/mauidocs/a.esecurity.html


LOGFILE               maui.log
LOGFILEMAXSIZE        10000000
LOGLEVEL              3

# Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html

QUEUETIMEWEIGHT       1 

# FairShare: http://supercluster.org/mauidocs/6.3fairshare.html

#FSPOLICY              PSDEDICATED
#FSDEPTH               7
#FSINTERVAL            86400
#FSDECAY               0.80

# Throttling Policies: http://supercluster.org/mauidocs/6.2throttlingpolicies.html

# NONE SPECIFIED

# Backfill: http://supercluster.org/mauidocs/8.2backfill.html

BACKFILLPOLICY  ON
RESERVATIONPOLICY     CURRENTHIGHEST

# Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html

NODEALLOCATIONPOLICY  MINRESOURCE

# QOS: http://supercluster.org/mauidocs/7.3qos.html

# QOSCFG[hi]  PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB
# QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE

# Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html

# SRSTARTTIME[test] 8:00:00
# SRENDTIME[test]   17:00:00
# SRDAYS[test]      MON TUE WED THU FRI
# SRTASKCOUNT[test] 20
# SRMAXTIME[test]   0:30:00

# Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html

# USERCFG[DEFAULT]      FSTARGET=25.0
# USERCFG[john]         PRIORITY=100  FSTARGET=10.0-
# GROUPCFG[staff]       PRIORITY=1000 QLIST=hi:low QDEF=hi
# CLASSCFG[batch]       FLAGS=PREEMPTEE
# CLASSCFG[interactive] FLAGS=PREEMPTOR

# NODEACCESSPOLICY  DEDICATED
NODEACCESSPOLICY  SHARED

---------------------------------------------------------------------
Richard A. Young
Division of ICT Services
Email: Richard.Young at usq.edu.au   Phone: (07) 46315557   
Mob:   0437544370          Fax:   (07) 46312798 
---------------------------------------------------------------------


-----Original Message-----
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Gus Correa
Sent: Friday, 12 August 2011 2:42 AM
To: Torque Users Mailing List
Subject: Re: [torqueusers] Cannot get more than 1 core on a node

Anything on your maui.cfg file
that may prevent the job to run?

Richard Young wrote:
> I am setting up a cluster environment using Torque 2.3.7 and Maui 3.2.6. These are older version however they are part cluster management software package. On the cluster I have setup 2 queues as below:
> 
> #
> # Create and define queue standard
> #
> create queue standard
> set queue standard queue_type = Execution
> set queue standard Priority = 15
> set queue standard max_queuable = 60
> set queue standard max_running = 40
> set queue standard resources_max.walltime = 72:00:00
> set queue standard resources_default.neednodes = standard
> set queue standard resources_default.walltime = 24:00:00
> set queue standard max_user_run = 10
> set queue standard enabled = True
> set queue standard started = True
> #
> # Create and define queue habeus
> #
> create queue habeus
> set queue habeus queue_type = Execution
> set queue habeus Priority = 16
> set queue habeus max_queuable = 16
> set queue habeus max_running = 8
> set queue habeus resources_max.walltime = 72:00:00
> set queue habeus resources_default.neednodes = habeus
> set queue habeus resources_default.walltime = 24:00:00
> set queue habeus max_user_run = 4
> set queue habeus enabled = True
> set queue habeus started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server acl_hosts = pbs_server
> set server acl_hosts += pbs_oscar
> set server acl_hosts += hpc
> set server acl_hosts += hpc00
> set server managers = root at hpc
> set server managers += root at hpc00
> set server operators = root at hpc
> set server operators += root at hpc00
> set server default_queue = standard
> set server log_events = 64
> set server mail_from = hpcadmin
> set server query_other_jobs = True
> set server scheduler_iteration = 60
> set server node_check_rate = 150
> set server tcp_timeout = 6
> set server submit_hosts = pbs_server
> set server submit_hosts += pbs_oscar
> set server submit_hosts += hpc00
> set server submit_hosts += hpc
> set server log_file_roll_depth = 20
> set server log_keep_days = 30
> set server next_job_number = 3466
> 
> Each node of the standard queue are setup like and there are 12 nodes with 8 cores on each node:
> usqhpc01 np=8 allnodes standard
> usqhpc02 np=8 allnodes standard
> usqhpc03 np=8 allnodes standard
> usqhpc04 np=8 allnodes standard
> There is only one node on the habeus queue and its setup in the nodes file is:
> habeus np=24 habeus
> 
> To me both queues are setup the same however when a job is run using "#PBS -l nodes=1:ppn=2" on the standard queue it gets deferred with the following error:
> 
> State: Idle  EState: Deferred
> Creds:  user:youngr  group:ict  class:standard  qos:DEFAULT
> WallTime: 00:00:00 of 00:05:00
> SubmitTime: Thu Aug 11 13:25:25
>   (Time Queued  Total: 3:33:22  Eligible: 00:00:00)
> 
> Total Tasks: 2
> 
> Req[0]  TaskCount: 2  Partition: ALL
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [standard][hpc02]
> 
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 0
> PartitionMask: [ALL]
> Flags:       RESTARTABLE
> 
> job is deferred.  Reason:  NoResources  (cannot create reservation for job '3466' (intital reservation attempt)
> )
> Holds:    Defer  (hold reason:  NoResources)
> PE:  2.00  StartPriority:  32
> cannot select job 3466 for partition DEFAULT (job hold active)
> 
> But when the exact same job is run on the habeus queue, the job runs and completes correctly. If I select 1 core on 1 or more nodes the same job also runs correctly. I have tried different parameters on the standard queue but it still won't run a job using more than 1 core. The maui log files don't provide any information as to what is happening.
> 
> Has anybody seen this problem before and fixed it or provide some hints on how to fix it.
> 
> Thank you
> ---------------------------------------------------------------------
> Richard A. Young
> Division of ICT Services
> HPC Support Officer
> University of Southern Queensland
> Toowoomba, Queensland 4350
> Australia 
> Email: Richard.Young at usq.edu.au   Phone: (07) 46315557   
> Mob:   0437544370          Fax:   (07) 46312798 
> ---------------------------------------------------------------------
> 
> 
> This email (including any attached files) is confidential and is for the
> intended recipient(s) only.  If you received this email by mistake,
> please, as a courtesy, tell the sender, then delete this email.
> 
> The views and opinions are the originator's and do not necessarily
> reflect those of the University of Southern Queensland.  Although all
> reasonable precautions were taken to ensure that this email contained no
> viruses at the time it was sent we accept no liability for any losses
> arising from its receipt.
> 
> The University of Southern Queensland is a registered provider of
> education with the Australian Government (CRICOS Institution Code No's.
> QLD 00244B / NSW 02225M)
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

This email (including any attached files) is confidential and is for the
intended recipient(s) only.  If you received this email by mistake,
please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily
reflect those of the University of Southern Queensland.  Although all
reasonable precautions were taken to ensure that this email contained no
viruses at the time it was sent we accept no liability for any losses
arising from its receipt.

The University of Southern Queensland is a registered provider of
education with the Australian Government (CRICOS Institution Code No's.
QLD 00244B / NSW 02225M)




More information about the torqueusers mailing list