[torqueusers] help me about torque+mpich with more than 12 processes

ccsad ccsad at pku.edu.cn
Mon Feb 23 20:16:10 MST 2009


HI everyone,
I am using torque and mpich1.2.7 on a Gigebit Ethernet, the following is my submit file:
#PBS -q verylong
#PBS -l nodes=2:ppn=5
#PBS -N PI
#PBS -j oe
cat $PBS_NODEFILE |tee list
cd $PBS_O_WORKDIR
/opt/mpich1.2.7/bin/mpirun  -np  10 ./cpi
 
it works, but whenever I change the ppn=5 to ppn=6,or higher, there is error message:
poll: protocol failure in circuit setup
p0_766:  p4_error: Child process exited while making connection to remote process on console: 0
rm_24911: (0.132812) net_send: could not write to fd=4, errno = 32
rm_24805: (0.308594) net_send: could not write to fd=4, errno = 32
rm_l_9_24980: (0.136719) net_send: could not write to fd=5, errno = 32
rm_l_2_24669: (0.562500) net_send: could not write to fd=5, errno = 32
rm_l_4_24735: (0.484375) net_send: could not write to fd=5, errno = 32
rm_24706: (0.484375) net_send: could not write to fd=4, errno = 32
rm_24640: (0.562500) net_send: could not write to fd=4, errno = 32
rm_24838: (0.226562) net_send: could not write to fd=4, errno = 32
rm_l_8_24867: (0.226562) net_send: could not write to fd=5, errno = 32
rm_l_7_24834: (0.312500) net_send: could not write to fd=5, errno = 32
rm_24607: (0.605469) net_send: could not write to fd=4, errno = 32
rm_l_1_24636: (0.609375) net_send: could not write to fd=5, errno = 32
rm_24673: (0.523438) net_send: could not write to fd=4, errno = 32
rm_l_3_24702: (0.523438) net_send: could not write to fd=5, errno = 32
rm_24772: (0.402344) net_send: could not write to fd=4, errno = 32
rm_24739: (0.441406) net_send: could not write to fd=4, errno = 32
rm_l_5_24768: (0.441406) net_send: could not write to fd=5, errno = 32
rm_l_10_25013: (0.011719) net_send: could not write to fd=5, errno = 32
rm_24984: (0.011719) net_send: could not write to fd=4, errno = 32
rm_l_6_24801: (0.402344) net_send: could not write to fd=5, errno = 32
p0_766: (28.667969) net_send: could not write to fd=4, errno = 32

==
but if I submit with :
/opt/mpich1.2.7/bin/mpirun -machinefile /home/skovira/test_pi_ge_mpich1/host48 -np 48  ./cpi
it works very well, so I  wonder if I can do sth. about pbs qmgr, who can help me?
Best regards!!
skovira
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090224/2a84941a/attachment.html


More information about the torqueusers mailing list