[torqueusers] pbs -l procs=n syntax defaults to 1

Kevin Sutherland sutherland.kevinr at gmail.com
Mon Dec 16 11:42:58 MST 2013


Greetings,

I have posted this on both torque and maui user boards as I am unsure
whether the issue is in maui or torque (although we had this same problem
before we ran maui)

I am configuring a cluster for engineering simulation use at my office. We
have two clusters (one with 12 nodes and 16 processors per node and the
other is a 5 node cluster with 16 processors per node, except for a bigmem
machine with 32 processors).

I am only working on the 5 node cluster at this time, but the behavior I am
dealing with is on both clusters. When the procs syntax is used, the system
is defaulting to 1 process, even though procs is > 1. All nodes show free
when issuing qnodes or pbsnodes -a and list the appropriate number of cpus
defined in the nodes file.

I have a simple test script:

#!/bin/bash

#PBS -S /bin/bash
#PBS -l nodes=2:ppn=8
#PBS -j oe

cat $PBS_NODEFILE

This script prints out:

pegasus.am1.mnet
pegasus.am1.mnet
pegasus.am1.mnet
pegasus.am1.mnet
pegasus.am1.mnet
pegasus.am1.mnet
pegasus.am1.mnet
pegasus.am1.mnet
amdfr1.am1.mnet
amdfr1.am1.mnet
amdfr1.am1.mnet
amdfr1.am1.mnet
amdfr1.am1.mnet
amdfr1.am1.mnet
amdfr1.am1.mnet
amdfr1.am1.mnet

Which is expected. When I change the PBS resource list to:

#PBS -l procs=32

I get the following:

pegasus.am1.mnet

The machine filed created in /var/spool/torque/aux simply has 1 entry for 1
process, even though I requested 32. We have a piece of simulation software
that REQUIRES the use of the "-l procs=n" syntax to function on the
cluster. (ANSYS does not plan to permit changes to this until Release 16 in
2015) We are trying to use our cluster with Ansys RSM with CFX and Fluent.

We are running torque 4.2.6.1 and Maui 3.3.1.

My queue and server attributes are defined as follows:

#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = titan1.am1.mnet
set server managers = kevin at titan1.am1.mnet
set server managers += root at titan1.am1.mnet
set server operators = kevin at titan1.am1.mnet
set server operators += root at titan1.am1.mnet
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 45
set server poll_jobs = True
set server mom_job_sync = True
set server keep_completed = 300
set server submit_hosts = titan1.am1.mnet
set server next_job_number = 8
set server moab_array_compatible = True
set server nppcu = 1

My torque nodes file is:

titan1.am1.mnet np=16 RAM64GB
titan2.am1.mnet np=16 RAM64GB
amdfl1.am1.mnet np=16 RAM64GB
amdfr1.am1.mnet np=16 RAM64GB
pegasus.am1.mnet np=32 RAM128GB

Our maui.cfg file is:

# maui.cfg 3.3.1

SERVERHOST            titan1.am1.mnet
# primary admin must be first in list
ADMIN1                root kevin
ADMIN3              ALL

# Resource Manager Definition

RMCFG[TITAN1.AM1.MNET] TYPE=PBS

# Allocation Manager Definition

AMCFG[bank]  TYPE=NONE

# full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html
# use the 'schedctl -l' command to display current configuration

RMPOLLINTERVAL        00:00:30

SERVERPORT            42559
SERVERMODE            NORMAL

# Admin: http://supercluster.org/mauidocs/a.esecurity.html


LOGFILE               maui.log
LOGFILEMAXSIZE        10000000
LOGLEVEL              3

# Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html

QUEUETIMEWEIGHT       1

# FairShare: http://supercluster.org/mauidocs/6.3fairshare.html

#FSPOLICY              PSDEDICATED
#FSDEPTH               7
#FSINTERVAL            86400
#FSDECAY               0.80

# Throttling Policies:
http://supercluster.org/mauidocs/6.2throttlingpolicies.html

# NONE SPECIFIED

# Backfill: http://supercluster.org/mauidocs/8.2backfill.html

BACKFILLPOLICY        FIRSTFIT
RESERVATIONPOLICY     CURRENTHIGHEST

# Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html

NODEALLOCATIONPOLICY  MINRESOURCE

# Kevin's Modifications:

JOBNODEMATCHPOLICY EXACTNODE


# QOS: http://supercluster.org/mauidocs/7.3qos.html

# QOSCFG[hi]  PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB
# QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE

# Standing Reservations:
http://supercluster.org/mauidocs/7.1.3standingreservations.html

# SRSTARTTIME[test] 8:00:00
# SRENDTIME[test]   17:00:00
# SRDAYS[test]      MON TUE WED THU FRI
# SRTASKCOUNT[test] 20
# SRMAXTIME[test]   0:30:00

# Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html

# USERCFG[DEFAULT]      FSTARGET=25.0
# USERCFG[john]         PRIORITY=100  FSTARGET=10.0-
# GROUPCFG[staff]       PRIORITY=1000 QLIST=hi:low QDEF=hi
# CLASSCFG[batch]       FLAGS=PREEMPTEE
# CLASSCFG[interactive] FLAGS=PREEMPTOR

Our MOM config file is:

$pbsserver    10.0.0.10    # IP address of titan1.am1.mnet
$clienthost    10.0.0.10    # IP address of management node
$usecp        *:/home/kevin /home/kevin
$usecp        *:/home /home
$usecp        *:/root /root
$usecp        *:/home/mpi /home/mpi
$tmpdir        /home/mpi/tmp

I am finding it difficult to identify the configuration issue. I thought
this thread would help:

http://comments.gmane.org/gmane.comp.clustering.maui.user/2859

but their examples show the machine file is working correctly and they are
battling memory allocations. I can't seem to get that far yet. Any thoughts?

-- 
Kevin Sutherland
Simulations Specialist
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131216/c4a5356b/attachment-0001.html 


More information about the torqueusers mailing list