[torqueusers] PBS job failure when trying to run an MPI program on multiple nodes

Garrick Staples garrick at clusterresources.com
Tue Jun 27 08:56:07 MDT 2006


TORQUE version?

On Mon, Jun 26, 2006 at 04:06:31PM -0400, Christopher J. Tengi alleged:
>    I've asked about this on the rocks-discuss list, but nobody there 
> seems to know exactly what is going on or why.  Below is my original 
> message to that list.  After sending the message, I discovered that my 
> MAUI configuration was optimizing my request and assigning up to 4 
> processors on the same node, rather than splitting my job up on 4 
> nodes,  as long as there was one node with 4 processors available.  
> Perhaps somebody on this list can shed more light on the subject....
> 
>   I am running Rocks 4.1 on a bunch of SunFire X4100s (x86_64) using 
> the PBS roll instead of SGE.  I have a very simple "hello world" type of 
> MPI program I'm using for testing, but my tests are failing when I try 
> to use multiple processors on multiple nodes.  Here is the PBS file:
> 
> ========
> :

What's this?  Should be something like #!/bin/sh

> #
> #PBS -l walltime=10:00,nodes=2:ppn=2
> #
> # merge STDERR into STDOUT file
> #PBS -j oe

I assume the problem goes away when you don't use -j?

> #
> # sends mail if the process aborts, when it begins, and
> # when it ends (abe)
> #PBS -m abe
> #PBS -M tengi at CS.Princeton.EDU
> #
> cd $PBS_O_WORKDIR
> mpiexec ./mpitest
> ========
> 
>   The error EMail I get is attached to this message, but appears to 
> boil down to:
> 
>       /opt/torque/spool/43.c2.cs.pr.OU: No such file or directory
> 
> Note that this job works fine with up to 4 processors on 1 node, and 
> works fine with 4 nodes with 1 processor per node.  However, If I try 
> anything with more than 1 node and more than one processor per node, I 
> get an error like the one above.  I just discovered that I also get a 
> similar error with more than 4 nodes, even if I specify only a single 
> processor per node.  I thought it might be related to directory modes on 
> the spool directory, but a cluster-forked 'ls' command returns output 
> like this for every compute node:
> 
> ========
> drwxr-xr-x  12 root root 4096 Oct 19  2005 /opt
> drwxr-xr-x  18 root root 4096 Jun 15 11:45 /opt/torque
> drwxrwxrwt   2 root root 4096 Jun 16 12:31 /opt/torque/spool
> ========
> 
>   One odd thing I see is that only 2 of the compute nodes (10 and 11 - 
> the top 2 reported "up" by 'pbsnodes -a') have spool directory 
> timestamps today.  I don't know if/how that matters.  BTW, here is the 
> 'pbsnodes -a' output for compute-0-11.  The 11 nodes before it have 
> similar information:
> 
> ========
> compute-0-11.local
>    state = free
>    np = 4
>    ntype = cluster
>    status = opsys=linux,uname=Linux compute-0-11.local 2.6.9-22.ELsmp 
> #1 SMP Sat Oct 8 21:32:36 BST 2005 
> x86_64,sessions=?0,nsessions=?0,nusers=0,idletime=89152,totmem=16239556kb,availmem=16126752kb,physmem=8046452kb,ncpus=4,loadave=0.00,netload=8891140982,state=free,jobs=?0,rectime=1150475853 
> 
> ========
> 
>   So, has anybody seen this before?  Any ideas as to what I may be 
> doing wrong?  Do I need to change anything from the default PBS or MAUI 
> configurations?  It looks like I can only use a total of 4 processors 
> and they either need to only be on 1 node or 1 per node on 4 nodes.  If 
> it was just a processor count limit, I would have expected 2 nodes with 
> 2 processors each to work.  In any case, if it was a resource limit 
> problem, I would have expected a different failure scenario.  Note that 
> a google search for "Unable to copy file /opt/torque/spool/" came up 
> with only one hit on the torqueusers mailing list and there was no 
> resolution.  Should I be sending my query there instead of here?
> 
>               Thanks,
>                           /Chris

> Date: Fri, 16 Jun 2006 12:10:49 -0400 (EDT)
> From: adm at c2.cs.princeton.edu (root)
> Subject: PBS JOB 43.c2.cs.princeton.edu
> To: tengi at CS.Princeton.EDU
> 
> PBS Job Id: 43.c2.cs.princeton.edu
> Job Name:   mpitest2.pbs
> An error has occurred processing your job, see below.
> Post job file processing error; job 43.c2.cs.princeton.edu on host compute-0-11.local/1+compute-0-11.local/0+compute-0-10.local/1+compute-0-10.local/0
> 
> Unable to copy file /opt/torque/spool/43.c2.cs.pr.OU to atengi at c2.cs.princeton.edu:/u/atengi/cluster/mpitest/mpitest2.pbs.o43
> >>> error from copy
> /opt/torque/spool/43.c2.cs.pr.OU: No such file or directory
> >>> end error output

I've done a lot of work in the data staging code, but that was back in
2.0.0p5.  I don't recall this particular problem before.  I'm guessing
this is coming from a duplicated copy request.  Is the file actually
showing up at the destination?  Are you seeing failed rcp/scp's in your
syslog on c2.cs.princeton.edu?

Increase the loglevel on the MOMs and look at the log file.



More information about the torqueusers mailing list