[Mauiusers] Problem with Torque/Maui

S Ranjan sranjan at ipr.res.in
Wed Jan 24 17:07:56 MST 2007


Garrick Staples wrote:

>On Wed, Jan 24, 2007 at 08:51:11AM +0530, S Ranjan alleged:
>  
>
>>Hi
>>
>>I have torque pbs_server running on the headnode, which is also the 
>>submit host.  There are 32 other compute nodes, mentioned in 
>>/var/spool/torque/server_priv/nodes file.  There is a single queue at 
>>present.  Sometimes, mpi jobs requesting for 28/30 nodes, land up 
>>running on the head node, though the head node is not a compute node at 
>>all.  netstat -anp shows several sockets being openend for the job, and 
>>eventually the head node hangs up. 
>>
>>Appreciate any help/suggestion on this.
>>    
>>
>
>Which MPI?  MPICH?  I'd guess mpirun is using the default machinefile
>that is created when mpich is built, and not the hostlist provided by
>the PBS job.
>
>Run mpirun with "-machinefile $PBS_NODEFILE" or use OSC's mpiexec
>instead of mpirun: http://www.osc.edu/~pw/mpiexec/
>
>_______________________________________________
>mauiusers mailing list
>mauiusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/mauiusers
>_____________________________________________________________________
>
>The mail server at Institute for Plasma Research has scanned this
>email for Virus using ClamAV 0.88.4
>_____________________________________________________________________
>
>
>  
>

We are using Intel mpi 2.0.  We are using mpiexec -n 28 ......      
inside the pbs script.
However, for mpdboot (executable in the mpi 2.0 binary dir), we are 
running it before running the pbs script. The exact syntax being used is

mpdboot -n 32 -f mpd.hosts --rsh=ssh -v

mpd.hosts file, residing in the user's home directory,  contains the 
names of the 32 compute nodes (excluding the head node).


Sutapa Ranjan



More information about the mauiusers mailing list