[torqueusers] PBS job failure when trying to run an MPI program on
multiple nodes
Christopher J. Tengi
tengi at CS.Princeton.EDU
Mon Jun 26 14:06:31 MDT 2006
I've asked about this on the rocks-discuss list, but nobody there
seems to know exactly what is going on or why. Below is my original
message to that list. After sending the message, I discovered that my
MAUI configuration was optimizing my request and assigning up to 4
processors on the same node, rather than splitting my job up on 4
nodes, as long as there was one node with 4 processors available.
Perhaps somebody on this list can shed more light on the subject....
I am running Rocks 4.1 on a bunch of SunFire X4100s (x86_64) using
the PBS roll instead of SGE. I have a very simple "hello world" type of
MPI program I'm using for testing, but my tests are failing when I try
to use multiple processors on multiple nodes. Here is the PBS file:
========
:
#
#PBS -l walltime=10:00,nodes=2:ppn=2
#
# merge STDERR into STDOUT file
#PBS -j oe
#
# sends mail if the process aborts, when it begins, and
# when it ends (abe)
#PBS -m abe
#PBS -M tengi at CS.Princeton.EDU
#
cd $PBS_O_WORKDIR
mpiexec ./mpitest
========
The error EMail I get is attached to this message, but appears to
boil down to:
/opt/torque/spool/43.c2.cs.pr.OU: No such file or directory
Note that this job works fine with up to 4 processors on 1 node, and
works fine with 4 nodes with 1 processor per node. However, If I try
anything with more than 1 node and more than one processor per node, I
get an error like the one above. I just discovered that I also get a
similar error with more than 4 nodes, even if I specify only a single
processor per node. I thought it might be related to directory modes on
the spool directory, but a cluster-forked 'ls' command returns output
like this for every compute node:
========
drwxr-xr-x 12 root root 4096 Oct 19 2005 /opt
drwxr-xr-x 18 root root 4096 Jun 15 11:45 /opt/torque
drwxrwxrwt 2 root root 4096 Jun 16 12:31 /opt/torque/spool
========
One odd thing I see is that only 2 of the compute nodes (10 and 11 -
the top 2 reported "up" by 'pbsnodes -a') have spool directory
timestamps today. I don't know if/how that matters. BTW, here is the
'pbsnodes -a' output for compute-0-11. The 11 nodes before it have
similar information:
========
compute-0-11.local
state = free
np = 4
ntype = cluster
status = opsys=linux,uname=Linux compute-0-11.local 2.6.9-22.ELsmp
#1 SMP Sat Oct 8 21:32:36 BST 2005
x86_64,sessions=?0,nsessions=?0,nusers=0,idletime=89152,totmem=16239556kb,availmem=16126752kb,physmem=8046452kb,ncpus=4,loadave=0.00,netload=8891140982,state=free,jobs=?0,rectime=1150475853
========
So, has anybody seen this before? Any ideas as to what I may be
doing wrong? Do I need to change anything from the default PBS or MAUI
configurations? It looks like I can only use a total of 4 processors
and they either need to only be on 1 node or 1 per node on 4 nodes. If
it was just a processor count limit, I would have expected 2 nodes with
2 processors each to work. In any case, if it was a resource limit
problem, I would have expected a different failure scenario. Note that
a google search for "Unable to copy file /opt/torque/spool/" came up
with only one hit on the torqueusers mailing list and there was no
resolution. Should I be sending my query there instead of here?
Thanks,
/Chris
-------------- next part --------------
An embedded message was scrubbed...
From: adm at c2.cs.princeton.edu (root)
Subject: PBS JOB 43.c2.cs.princeton.edu
Date: Fri, 16 Jun 2006 12:10:49 -0400 (EDT)
Size: 1925
Url: http://www.supercluster.org/pipermail/torqueusers/attachments/20060626/db22e3b1/PBSJOB43.c2.cs.princeton-0001.mht
More information about the torqueusers
mailing list