[torqueusers] PBS job failure when trying to run an MPI program on multiple nodes

Christopher J. Tengi tengi at CS.Princeton.EDU
Tue Jun 27 13:49:54 MDT 2006


Garrick,

on 06/27/2006 10:56 AM Garrick Staples said the following:
> TORQUE version?
>
>   

torque-2.0.0p7-1 is what I get back from an rpm query.  'qsub --version' 
returns 2.0.0p7

> On Mon, Jun 26, 2006 at 04:06:31PM -0400, Christopher J. Tengi alleged:
>   
>>    I've asked about this on the rocks-discuss list, but nobody there 
>> seems to know exactly what is going on or why.  Below is my original 
>> message to that list.  After sending the message, I discovered that my 
>> MAUI configuration was optimizing my request and assigning up to 4 
>> processors on the same node, rather than splitting my job up on 4 
>> nodes,  as long as there was one node with 4 processors available.  
>> Perhaps somebody on this list can shed more light on the subject....
>>
>>   I am running Rocks 4.1 on a bunch of SunFire X4100s (x86_64) using 
>> the PBS roll instead of SGE.  I have a very simple "hello world" type of 
>> MPI program I'm using for testing, but my tests are failing when I try 
>> to use multiple processors on multiple nodes.  Here is the PBS file:
>>
>> ========
>> :
>>     
>
> What's this?  Should be something like #!/bin/sh
>
>   

Starting shell scripts with a colon is an ancient practice.  The colon 
is a "do nothing - successfully" command in a number of shells.  It 
appeared in the example I based my script on (before I trimmed it down 
severely) so I just left it in.

>> #
>> #PBS -l walltime=10:00,nodes=2:ppn=2
>> #
>> # merge STDERR into STDOUT file
>> #PBS -j oe
>>     
>
> I assume the problem goes away when you don't use -j?
>
>   

Nope.  The problem just changes.  Now I get 2 "unable to copy" errors.  
One is for the .OU file and the other is for the .ER file.

>> #
>> # sends mail if the process aborts, when it begins, and
>> # when it ends (abe)
>> #PBS -m abe
>> #PBS -M tengi at CS.Princeton.EDU
>> #
>> cd $PBS_O_WORKDIR
>> mpiexec ./mpitest
>> ========
>>
>>   The error EMail I get is attached to this message, but appears to 
>> boil down to:
>>
>>       /opt/torque/spool/43.c2.cs.pr.OU: No such file or directory
>>
>> Note that this job works fine with up to 4 processors on 1 node, and 
>> works fine with 4 nodes with 1 processor per node.  However, If I try 
>> anything with more than 1 node and more than one processor per node, I 
>> get an error like the one above.  I just discovered that I also get a 
>> similar error with more than 4 nodes, even if I specify only a single 
>> processor per node.  I thought it might be related to directory modes on 
>> the spool directory, but a cluster-forked 'ls' command returns output 
>> like this for every compute node:
>>
>> ========
>> drwxr-xr-x  12 root root 4096 Oct 19  2005 /opt
>> drwxr-xr-x  18 root root 4096 Jun 15 11:45 /opt/torque
>> drwxrwxrwt   2 root root 4096 Jun 16 12:31 /opt/torque/spool
>> ========
>>
>>   One odd thing I see is that only 2 of the compute nodes (10 and 11 - 
>> the top 2 reported "up" by 'pbsnodes -a') have spool directory 
>> timestamps today.  I don't know if/how that matters.  BTW, here is the 
>> 'pbsnodes -a' output for compute-0-11.  The 11 nodes before it have 
>> similar information:
>>
>> ========
>> compute-0-11.local
>>    state = free
>>    np = 4
>>    ntype = cluster
>>    status = opsys=linux,uname=Linux compute-0-11.local 2.6.9-22.ELsmp 
>> #1 SMP Sat Oct 8 21:32:36 BST 2005 
>> x86_64,sessions=?0,nsessions=?0,nusers=0,idletime=89152,totmem=16239556kb,availmem=16126752kb,physmem=8046452kb,ncpus=4,loadave=0.00,netload=8891140982,state=free,jobs=?0,rectime=1150475853 
>>
>> ========
>>
>>   So, has anybody seen this before?  Any ideas as to what I may be 
>> doing wrong?  Do I need to change anything from the default PBS or MAUI 
>> configurations?  It looks like I can only use a total of 4 processors 
>> and they either need to only be on 1 node or 1 per node on 4 nodes.  If 
>> it was just a processor count limit, I would have expected 2 nodes with 
>> 2 processors each to work.  In any case, if it was a resource limit 
>> problem, I would have expected a different failure scenario.  Note that 
>> a google search for "Unable to copy file /opt/torque/spool/" came up 
>> with only one hit on the torqueusers mailing list and there was no 
>> resolution.  Should I be sending my query there instead of here?
>>
>>               Thanks,
>>                           /Chris
>>     
>
>   
>> Date: Fri, 16 Jun 2006 12:10:49 -0400 (EDT)
>> From: adm at c2.cs.princeton.edu (root)
>> Subject: PBS JOB 43.c2.cs.princeton.edu
>> To: tengi at CS.Princeton.EDU
>>
>> PBS Job Id: 43.c2.cs.princeton.edu
>> Job Name:   mpitest2.pbs
>> An error has occurred processing your job, see below.
>> Post job file processing error; job 43.c2.cs.princeton.edu on host compute-0-11.local/1+compute-0-11.local/0+compute-0-10.local/1+compute-0-10.local/0
>>
>> Unable to copy file /opt/torque/spool/43.c2.cs.pr.OU to atengi at c2.cs.princeton.edu:/u/atengi/cluster/mpitest/mpitest2.pbs.o43
>>     
>>>>> error from copy
>>>>>           
>> /opt/torque/spool/43.c2.cs.pr.OU: No such file or directory
>>     
>>>>> end error output
>>>>>           
>
> I've done a lot of work in the data staging code, but that was back in
> 2.0.0p5.  I don't recall this particular problem before.  I'm guessing
> this is coming from a duplicated copy request.  Is the file actually
> showing up at the destination?  Are you seeing failed rcp/scp's in your
> syslog on c2.cs.princeton.edu?
>   

Having no idea how torque operates, I never thought to look for scp 
errors.  Looking in /var/log/daemon and /var/log/messages yields results 
like this:

Jun 27 14:18:35 compute-0-7.local pbs_mom: sys_copy, command '/usr/bin/scp -Brp /opt/torque/spool/76.c2.cs.pr.OU atengi at c2.cs.princeton.edu:/u/atengi/cluster/mpitest/mpitest3.pbs.o76' failed with status=1, giving up after 4 attempts 
Jun 27 14:18:39 compute-0-7.local pbs_mom: sys_copy, command '/usr/bin/scp -Brp /opt/torque/spool/76.c2.cs.pr.ER atengi at c2.cs.princeton.edu:/u/atengi/cluster/mpitest/mpitest3.pbs.e76' failed with status=1, giving up after 4 attempts 


Of the 3 nodes assigned for this job, compute-0-7 is the only one whose 
spool directory had a timestamp today.  Note that if I run with just a 
single node and up to 4 processors on that node, I do not get any errors 
and the job completes.  It is only when I try to explicitly (or 
implicitly) try to use more than 1 nodes that I get these errors.  It is 
as though the files are copied back in a different fashion depending on 
whether you use 1 or more than 1 node.  Is one node considered a 
"control node?"  Might it be that copying information from non-control 
nodes to the control node blows away the file(s)?

> Increase the loglevel on the MOMs and look at the log file.
>
>   

It looks like the mom_priv/config file on the compute nodes contains:

$pbsserver c2.local
$usecp c2.cs.princeton.edu:/home /home


and nothing else.  When I add "$loglevel 7" on all of the compute nodes 
and restart pbs, I get nothing useful because the compute nodes nodes no 
longer talk (on udp port 15001) to the frontend.  It looks like *that* 
is a problem with the compute nodes using the frontend's public address 
rather than the internal address, so I suppose I'll have to address that 
as well.  However, before I went down this path, I saw the following in 
the mom_log file on compute-0-7 (from which I had gotten the error 
EMail, apparently):

06/27/2006 14:53:14;0100;   pbs_mom;Req;;Type QueueJob request received from PBS_Server at 172.17.24.2, sock=10
06/27/2006 14:53:14;0100;   pbs_mom;Req;;Type JobScript request received from PBS_Server at 172.17.24.2, sock=10
06/27/2006 14:53:14;0100;   pbs_mom;Req;;Type ReadyToCommit request received from PBS_Server at 172.17.24.2, sock=10
06/27/2006 14:53:14;0100;   pbs_mom;Req;;Type Commit request received from PBS_Server at 172.17.24.2, sock=10
06/27/2006 14:53:14;0001;   pbs_mom;Svr;pbs_mom;No such file or directory (2) in start_exec, rpp_open failed on compute-0-1.local
06/27/2006 14:53:14;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
06/27/2006 14:53:14;0100;   pbs_mom;Req;;Type StatusJob request received from PBS_Server at 172.17.24.2, sock=13
06/27/2006 14:53:14;0100;   pbs_mom;Req;;Type ModifyJob request received from PBS_Server at 172.17.24.2, sock=10
06/27/2006 14:53:14;0008;   pbs_mom;Job;77.c2.cs.princeton.edu;Job Modified at request of PBS_Server at 172.17.24.2
06/27/2006 14:53:14;0100;   pbs_mom;Req;;Type CopyFiles request received from PBS_Server at 172.17.24.2, sock=14


I'm guessing that the rpp_open is the problem, but since I don't really 
know how any of this works, I haven't yet figured out the answer.  For 
now, I'm going to investigate the IP address issue listed above to see 
if *that* helps me get anywhere.

                /Chris
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>   



More information about the torqueusers mailing list