[torqueusers] question about pbs with more than 12... processes

Glen Beane glen.beane at gmail.com
Tue Feb 24 08:14:50 MST 2009


those are mpich errors, not torque errors.  It might be best to ask
that question on an mpich mailing list.

Also, if you are going to use mpich 1.x with TORQUE you should use
OSC's mpiexec (www.osc.edu/~pw/mpiexec) instead of mpirun.  It makes
mpich integrate much better with TORQUE.



On Mon, Feb 23, 2009 at 10:11 PM, ccsad <ccsad at pku.edu.cn> wrote:
> HI everyone,
> I am using torque and mpich1.2.7 on a Gigebit Ethernet, the following is my submit file:
> #PBS -q verylong
> #PBS -l nodes=2:ppn=5
> #PBS -N PI
> #PBS -j oe
> cat $PBS_NODEFILE |tee list
> cd $PBS_O_WORKDIR
> /opt/mpich1.2.7/bin/mpirun  -np  10 ./cpi
>
> it works, but whenever I change the ppn=5 to ppn=6,or higher, there is error message:
> poll: protocol failure in circuit setup
> p0_766:  p4_error: Child process exited while making connection to remote process on console: 0
> rm_24911: (0.132812) net_send: could not write to fd=4, errno = 32
> rm_24805: (0.308594) net_send: could not write to fd=4, errno = 32
> rm_l_9_24980: (0.136719) net_send: could not write to fd=5, errno = 32
> rm_l_2_24669: (0.562500) net_send: could not write to fd=5, errno = 32
> rm_l_4_24735: (0.484375) net_send: could not write to fd=5, errno = 32
> rm_24706: (0.484375) net_send: could not write to fd=4, errno = 32
> rm_24640: (0.562500) net_send: could not write to fd=4, errno = 32
> rm_24838: (0.226562) net_send: could not write to fd=4, errno = 32
> rm_l_8_24867: (0.226562) net_send: could not write to fd=5, errno = 32
> rm_l_7_24834: (0.312500) net_send: could not write to fd=5, errno = 32
> rm_24607: (0.605469) net_send: could not write to fd=4, errno = 32
> rm_l_1_24636: (0.609375) net_send: could not write to fd=5, errno = 32
> rm_24673: (0.523438) net_send: could not write to fd=4, errno = 32
> rm_l_3_24702: (0.523438) net_send: could not write to fd=5, errno = 32
> rm_24772: (0.402344) net_send: could not write to fd=4, errno = 32
> rm_24739: (0.441406) net_send: could not write to fd=4, errno = 32
> rm_l_5_24768: (0.441406) net_send: could not write to fd=5, errno = 32
> rm_l_10_25013: (0.011719) net_send: could not write to fd=5, errno = 32
> rm_24984: (0.011719) net_send: could not write to fd=4, errno = 32
> rm_l_6_24801: (0.402344) net_send: could not write to fd=5, errno = 32
> p0_766: (28.667969) net_send: could not write to fd=4, errno = 32
>
> ==
> but if I submit with :
> /opt/mpich1.2.7/bin/mpirun -machinefile /home/skovira/test_pi_ge_mpich1/host48 -np 48  ./cpi
> it works very well, so I  wonder if I can do sth. about pbs qmgr, who can help me?
> Best regards!!
> skovira
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


More information about the torqueusers mailing list