[Mauiusers] Problem with Torque/Maui

S Ranjan sranjan at ipr.res.in
Sun Jan 28 17:29:47 MST 2007


Garrick Staples wrote:

>On Thu, Jan 25, 2007 at 05:37:56AM +0530, S Ranjan alleged:
>  
>
>>Garrick Staples wrote:
>>
>>    
>>
>>>On Wed, Jan 24, 2007 at 08:51:11AM +0530, S Ranjan alleged:
>>>
>>>
>>>      
>>>
>>>>Hi
>>>>
>>>>I have torque pbs_server running on the headnode, which is also the 
>>>>submit host.  There are 32 other compute nodes, mentioned in 
>>>>/var/spool/torque/server_priv/nodes file.  There is a single queue at 
>>>>present.  Sometimes, mpi jobs requesting for 28/30 nodes, land up 
>>>>running on the head node, though the head node is not a compute node at 
>>>>all.  netstat -anp shows several sockets being openend for the job, and 
>>>>eventually the head node hangs up. 
>>>>
>>>>Appreciate any help/suggestion on this.
>>>>  
>>>>
>>>>        
>>>>
>>>Which MPI?  MPICH?  I'd guess mpirun is using the default machinefile
>>>that is created when mpich is built, and not the hostlist provided by
>>>the PBS job.
>>>
>>>Run mpirun with "-machinefile $PBS_NODEFILE" or use OSC's mpiexec
>>>instead of mpirun: http://www.osc.edu/~pw/mpiexec/
>>>
>>>_______________________________________________
>>>mauiusers mailing list
>>>mauiusers at supercluster.org
>>>http://www.supercluster.org/mailman/listinfo/mauiusers
>>>_____________________________________________________________________
>>>
>>>The mail server at Institute for Plasma Research has scanned this
>>>email for Virus using ClamAV 0.88.4
>>>_____________________________________________________________________
>>>
>>>
>>>
>>>
>>>      
>>>
>>We are using Intel mpi 2.0.  We are using mpiexec -n 28 ......      
>>inside the pbs script.
>>However, for mpdboot (executable in the mpi 2.0 binary dir), we are 
>>running it before running the pbs script. The exact syntax being used is
>>
>>mpdboot -n 32 -f mpd.hosts --rsh=ssh -v
>>
>>mpd.hosts file, residing in the user's home directory,  contains the 
>>names of the 32 compute nodes (excluding the head node).
>>    
>>
> <>
> There is your problem, you want to use the list of nodes assigned to
> your job. So you'll want something like this:
> np=$(wc -l < $PBS_NODEFILE)
> mpdboot -n $np -f $PBS_NODEFILE --rsh=ssh -v
>
> But I still recommend using OSC's mpiexec instead.

<>
Hi

Using OSC's mpiexec,  the mpi job starts and then gives the following 
errors (when used without any mpdboot) with --comm=pmi option .  When 
used without --comm=pmi, the job just aborts complaining that it cannot 
connect to mpd2.console (the same error that is generated if a mpi job 
is launched without starting mpdboot).
I am using Intel MPI 2.0.

Thanks is advance for any help/suggestions

Sutapa

aborting job:
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(385): MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier(75): 
MPIC_Sendrecv(152): 
MPIC_Wait(321): 
MPIDI_CH3_Progress_wait(202): an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(1022): [ch3:sock] failed to connnect to remote process 339.clustserver-spawn-0:2
MPIDU_Socki_handle_connect(780): connection failure (set=0,sock=1,errno=111:Connection refused)
aborting job:
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(385): MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier(75): 
MPIC_Sendrecv(152): 
MPIC_Wait(321): 
MPIDI_CH3_Progress_wait(202): an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(1022): [ch3:sock] failed to connnect to remote process 339.clustserver-spawn-0:3
MPIDU_Socki_handle_connect(780): connection failure (set=0,sock=1,errno=111:Connection refused)
aborting job:
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(385): MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier(75): 
MPIC_Sendrecv(152): 
MPIC_Wait(321): 
MPIDI_CH3_Progress_wait(202): an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(461): 
connection_recv_fail(1685): 
MPIDU_Socki_handle_read(627): connection failure (set=0,sock=1,errno=104:Connection reset by peer)
aborting job:
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(385): MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier(75): 
MPIC_Sendrecv(152): 
MPIC_Wait(321): 
MPIDI_CH3_Progress_wait(202): an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(1022): [ch3:sock] failed to connnect to remote process 339.clustserver-spawn-0:1
MPIDU_Socki_handle_connect(780): connection failure (set=0,sock=1,errno=111:Connection refused)
newmpiexec: Warning: tasks 0-3 exited with status 13.





>------------------------------------------------------------------------
>



More information about the mauiusers mailing list