[torqueusers] problems with PBS_NODEFILE and openmpi

Juergen Kabelitz jkabelitz at sysgen.de
Mon Feb 18 05:13:36 MST 2008


The system is a cluster with 1 headnode and 32 cluster nodes.

The OS is SuSE 10.2

I have installed torque-2.2.1
 ./configure --prefix=/usr/local

The pbs_server and pbs_sched are running on the headnode m01

The node file:
n02 np=4 shared
n03 np=4 shared
n04 np=4 shared
n05 np=4
n06 np=4
n08 np=4
n09 np=4
n10 np=4
n11 np=4
n12 np=4
n13 np=4
n14 np=4
n15 np=4
n16 np=4
n17 np=4
n18 np=4
n19 np=4
n20 np=4
n21 np=4
n22 np=4
n23 np=4
n24 np=4
n25 np=4
n26 np=4
n27 np=4
n28 np=4
n29 np=4
n30 np=4 cluster
n31 np=4 cluster
n32 np=4 cluster
n01 np=4
n07 np=4

My configuration are:
m01:/var/spool/torque/server_priv # qmgr -c 'p s'
# Create queues and set their attributes.
# Create and define queue batch
create queue batch
set queue batch queue_type = Execution
set queue batch resources_max.nodect = 128
set queue batch resources_min.nodect = 1
set queue batch resources_default.neednodes = 128
set queue batch resources_default.nodes = 128
set queue batch resources_default.walltime = 01:00:00
set queue batch resources_available.nodect = 999999
set queue batch enabled = True
set queue batch started = True
# Create and define queue cluster
create queue cluster
set queue cluster queue_type = Execution
set queue cluster max_running = 10
set queue cluster resources_max.nodect = 32
set queue cluster resources_max.nodes = 32
set queue cluster resources_min.nodect = 1
set queue cluster resources_min.nodes = 1
set queue cluster enabled = True
set queue cluster started = True
# Set server attributes.
set server scheduling = True
set server max_user_run = 128
set server managers = root at m01.local
set server operators = root at m01.local
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server resources_available.nodect = 999999
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server mom_job_sync = True
set server pbs_version = 2.2.1
set server keep_completed = 300

My problem is:
When I submit a job to the system with openmpi I got only one node

sysgen at m01:~> qsub -I -l nodes=12
qsub: waiting for job 188.m01.local to start
qsub: job 188.m01.local ready

sysgen at n02:~> echo $PBS_NODEFILE
sysgen at n02:~> cat /var/spool/torque/aux//188.m01.local
sysgen at n02:~>

When I start the following job
sysgen at m01:~> qsub -I -l nodes=n02+n03+n04
qsub: waiting for job 189.m01.local to start
qsub: job 189.m01.local ready

sysgen at n02:~> echo $PBS_NODEFILE
sysgen at n02:~> cat /var/spool/torque/aux//189.m01.local
sysgen at n02:~>

it works.

Where are my mistakes? Or is there something that I don't understand correctly?

J. Kabelitz

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080218/73b3960d/attachment.html

More information about the torqueusers mailing list