[Mauiusers] Problem with Torque/Maui
garrick at usc.edu
Wed Jan 24 23:40:16 MST 2007
On Thu, Jan 25, 2007 at 05:37:56AM +0530, S Ranjan alleged:
> Garrick Staples wrote:
> >On Wed, Jan 24, 2007 at 08:51:11AM +0530, S Ranjan alleged:
> >>I have torque pbs_server running on the headnode, which is also the
> >>submit host. There are 32 other compute nodes, mentioned in
> >>/var/spool/torque/server_priv/nodes file. There is a single queue at
> >>present. Sometimes, mpi jobs requesting for 28/30 nodes, land up
> >>running on the head node, though the head node is not a compute node at
> >>all. netstat -anp shows several sockets being openend for the job, and
> >>eventually the head node hangs up.
> >>Appreciate any help/suggestion on this.
> >Which MPI? MPICH? I'd guess mpirun is using the default machinefile
> >that is created when mpich is built, and not the hostlist provided by
> >the PBS job.
> >Run mpirun with "-machinefile $PBS_NODEFILE" or use OSC's mpiexec
> >instead of mpirun: http://www.osc.edu/~pw/mpiexec/
> >mauiusers mailing list
> >mauiusers at supercluster.org
> >The mail server at Institute for Plasma Research has scanned this
> >email for Virus using ClamAV 0.88.4
> We are using Intel mpi 2.0. We are using mpiexec -n 28 ......
> inside the pbs script.
> However, for mpdboot (executable in the mpi 2.0 binary dir), we are
> running it before running the pbs script. The exact syntax being used is
> mpdboot -n 32 -f mpd.hosts --rsh=ssh -v
> mpd.hosts file, residing in the user's home directory, contains the
> names of the 32 compute nodes (excluding the head node).
There is your problem, you want to use the list of nodes assigned to
your job. So you'll want something like this:
np=$(wc -l < $PBS_NODEFILE)
mpdboot -n $np -f $PBS_NODEFILE --rsh=ssh -v
But I still recommend using OSC's mpiexec instead.
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20070124/b6f86d48/attachment.bin
More information about the mauiusers