[torqueusers] Timeout using mpiexec and torque
Garrick Staples
garrick at clusterresources.com
Wed Nov 1 14:08:39 MST 2006
On Tue, Oct 31, 2006 at 02:53:31PM +0000, Eugene van den Hurk alleged:
> Hello,
>
> I have seen errors of the following nature coming
> up in some of my tests since using mpiexec to run mpi programs:
>
> #####################################################################
> p17_29269: p4_error: Timeout in establishing connection to remote process:
> 0
> rm_l_17_29270: (302.817528) net_send: could not write to fd=5, errno = 32
> p33_29615: p4_error: interrupt SIGx: 13
> p17_29269: (367.028649) net_send: could not write to fd=5, errno = 32
> p33_29615: (431.239899) net_send: could not write to fd=5, errno = 32
> p1_5045: p4_error: interrupt SIGx: 15
> rm_l_1_5046: (916.888204) net_send: could not write to fd=5, errno = 32
> mpiexec: Warning: task 1 exited oddly---report bug: status 0 done 0.
> mpiexec: Warning: tasks 17,33 exited with status 1.
> #####################################################################
>
> I am using the following:
> mpiexec 0.81
> mpich 1.2.7p1.
> Torque 2.1.6
> Maui 3.2.6p16
>
> These errors happen randomly.
> I have been trying to narrow it down to see if it is a problem with
> specific nodes on the cluster.
> But as I say it has been happening randomly and I haven't been able
> to track down anything conclusive.
>
> Just posting this to the list in case anybody else has had similar
> issues in the past and might be able to shed some light on this.
Looks like a network problem to me. Does it work consistently with
mpich's mpirun?
The difference between mpiexec and mpirun is that mpiexec uses TM to
execute the processes, and mpirun uses rsh/ssh to execute the remote
processes. Once they are executed, they basicly do the same thing
(which is completely outside of TORQUE.)
Test the TM interface with something simple like:
for a in $(seq 1 100);do
pbsdsh hostname
done
Then test mpiexec's TM implementation with the similar (note that
mpiexec will be a lot slower):
for a in $(seq 1 100);do
mpiexec -nostdout -nostdin --comm=none hostname
done
If the TM part is working OK, then you know it is some sort of network
problem outside of TORQUE.
More information about the torqueusers
mailing list