[torqueusers] Timeout using mpiexec and torque

Eugene van den Hurk e.vandenhurk at bcri.ucc.ie
Tue Oct 31 07:53:31 MST 2006


Hello,

I have seen errors of the following nature coming
up in some of my tests since using mpiexec to run mpi programs:

#####################################################################
p17_29269:  p4_error: Timeout in establishing connection to remote process: 0
rm_l_17_29270: (302.817528) net_send: could not write to fd=5, errno = 32
p33_29615:  p4_error: interrupt SIGx: 13
p17_29269: (367.028649) net_send: could not write to fd=5, errno = 32
p33_29615: (431.239899) net_send: could not write to fd=5, errno = 32
p1_5045:  p4_error: interrupt SIGx: 15
rm_l_1_5046: (916.888204) net_send: could not write to fd=5, errno = 32
mpiexec: Warning: task 1 exited oddly---report bug: status 0 done 0.
mpiexec: Warning: tasks 17,33 exited with status 1.
#####################################################################

I am using the following:
mpiexec 0.81
mpich 1.2.7p1.
Torque 2.1.6
Maui 3.2.6p16

These errors happen randomly.
I have been trying to narrow it down to see if it is a problem with
specific nodes on the cluster.
But as I say it has been happening randomly and I haven't been able
to track down anything conclusive.

Just posting this to the list in case anybody else has had similar
issues in the past and might be able to shed some light on this.


Thanks,
Regards,
Eugene.



More information about the torqueusers mailing list