[torqueusers] question about pbs with more than 12... processes

Gus Correa gus at ldeo.columbia.edu
Tue Feb 24 09:04:04 MST 2009


Hi Ccsad (?), list

The same issue appeared many times here and in other places,
with the same "P4" errors that appear and disappear depending
on the number of processors, etc.

As Glen said this is not a Torque problem, but MPICH,
and also the right mpiexec to use with torque is the
one from the Ohio Supercomputer Center (see link sent by Glen).

You should subscribe to and send these questions to the MPICH list,
where you'll get the right advice from the right group of users
and developers:

http://www.mcs.anl.gov/research/projects/mpich2/support/index.php?s=support

MPICH-1 is old, not maintained, and doesn't seem to work
with newer Linux kernels.
The easy fix is to upgrade to MPICH-2 with the Nemesis
communication channel:

http://www.mcs.anl.gov/research/projects/mpich2/

Here are a couple of recent threads about your problem:

http://marc.info/?l=npaci-rocks-discussion&m=123175012813683&w=2
http://www.supercluster.org/pipermail/torqueusers/2009-February/008736.html

I hope this helps,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

Glen Beane wrote:
> those are mpich errors, not torque errors.  It might be best to ask
> that question on an mpich mailing list.
> 
> Also, if you are going to use mpich 1.x with TORQUE you should use
> OSC's mpiexec (www.osc.edu/~pw/mpiexec) instead of mpirun.  It makes
> mpich integrate much better with TORQUE.
> 
> 
> 
> On Mon, Feb 23, 2009 at 10:11 PM, ccsad <ccsad at pku.edu.cn> wrote:
>> HI everyone,
>> I am using torque and mpich1.2.7 on a Gigebit Ethernet, the following is my submit file:
>> #PBS -q verylong
>> #PBS -l nodes=2:ppn=5
>> #PBS -N PI
>> #PBS -j oe
>> cat $PBS_NODEFILE |tee list
>> cd $PBS_O_WORKDIR
>> /opt/mpich1.2.7/bin/mpirun  -np  10 ./cpi
>>
>> it works, but whenever I change the ppn=5 to ppn=6,or higher, there is error message:
>> poll: protocol failure in circuit setup
>> p0_766:  p4_error: Child process exited while making connection to remote process on console: 0
>> rm_24911: (0.132812) net_send: could not write to fd=4, errno = 32
>> rm_24805: (0.308594) net_send: could not write to fd=4, errno = 32
>> rm_l_9_24980: (0.136719) net_send: could not write to fd=5, errno = 32
>> rm_l_2_24669: (0.562500) net_send: could not write to fd=5, errno = 32
>> rm_l_4_24735: (0.484375) net_send: could not write to fd=5, errno = 32
>> rm_24706: (0.484375) net_send: could not write to fd=4, errno = 32
>> rm_24640: (0.562500) net_send: could not write to fd=4, errno = 32
>> rm_24838: (0.226562) net_send: could not write to fd=4, errno = 32
>> rm_l_8_24867: (0.226562) net_send: could not write to fd=5, errno = 32
>> rm_l_7_24834: (0.312500) net_send: could not write to fd=5, errno = 32
>> rm_24607: (0.605469) net_send: could not write to fd=4, errno = 32
>> rm_l_1_24636: (0.609375) net_send: could not write to fd=5, errno = 32
>> rm_24673: (0.523438) net_send: could not write to fd=4, errno = 32
>> rm_l_3_24702: (0.523438) net_send: could not write to fd=5, errno = 32
>> rm_24772: (0.402344) net_send: could not write to fd=4, errno = 32
>> rm_24739: (0.441406) net_send: could not write to fd=4, errno = 32
>> rm_l_5_24768: (0.441406) net_send: could not write to fd=5, errno = 32
>> rm_l_10_25013: (0.011719) net_send: could not write to fd=5, errno = 32
>> rm_24984: (0.011719) net_send: could not write to fd=4, errno = 32
>> rm_l_6_24801: (0.402344) net_send: could not write to fd=5, errno = 32
>> p0_766: (28.667969) net_send: could not write to fd=4, errno = 32
>>
>> ==
>> but if I submit with :
>> /opt/mpich1.2.7/bin/mpirun -machinefile /home/skovira/test_pi_ge_mpich1/host48 -np 48  ./cpi
>> it works very well, so I  wonder if I can do sth. about pbs qmgr, who can help me?
>> Best regards!!
>> skovira
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list