[torqueusers] Timeout using mpiexec and torque

Garrick Staples garrick at clusterresources.com
Wed Nov 1 14:08:39 MST 2006


On Tue, Oct 31, 2006 at 02:53:31PM +0000, Eugene van den Hurk alleged:
> Hello,
> 
> I have seen errors of the following nature coming
> up in some of my tests since using mpiexec to run mpi programs:
> 
> #####################################################################
> p17_29269:  p4_error: Timeout in establishing connection to remote process: 
> 0
> rm_l_17_29270: (302.817528) net_send: could not write to fd=5, errno = 32
> p33_29615:  p4_error: interrupt SIGx: 13
> p17_29269: (367.028649) net_send: could not write to fd=5, errno = 32
> p33_29615: (431.239899) net_send: could not write to fd=5, errno = 32
> p1_5045:  p4_error: interrupt SIGx: 15
> rm_l_1_5046: (916.888204) net_send: could not write to fd=5, errno = 32
> mpiexec: Warning: task 1 exited oddly---report bug: status 0 done 0.
> mpiexec: Warning: tasks 17,33 exited with status 1.
> #####################################################################
> 
> I am using the following:
> mpiexec 0.81
> mpich 1.2.7p1.
> Torque 2.1.6
> Maui 3.2.6p16
> 
> These errors happen randomly.
> I have been trying to narrow it down to see if it is a problem with
> specific nodes on the cluster.
> But as I say it has been happening randomly and I haven't been able
> to track down anything conclusive.
> 
> Just posting this to the list in case anybody else has had similar
> issues in the past and might be able to shed some light on this.

Looks like a network problem to me.  Does it work consistently with
mpich's mpirun?

The difference between mpiexec and mpirun is that mpiexec uses TM to
execute the processes, and mpirun uses rsh/ssh to execute the remote
processes.  Once they are executed, they basicly do the same thing
(which is completely outside of TORQUE.)

Test the TM interface with something simple like:
  for a in $(seq 1 100);do
     pbsdsh hostname
  done

Then test mpiexec's TM implementation with the similar (note that
mpiexec will be a lot slower):
  for a in $(seq 1 100);do
     mpiexec -nostdout -nostdin --comm=none hostname
  done

If the TM part is working OK, then you know it is some sort of network
problem outside of TORQUE.



More information about the torqueusers mailing list