[torqueusers] specific nodes
lloyd_brown at byu.edu
Wed Nov 30 15:51:43 MST 2011
On 11/30/2011 03:38 PM, Ricardo Román Brenes wrote:
> 1. Why there has to be a match between processors and processes? i could
> run 1024 process in 1 processor (without torque). Requesting 2 nodes i
> could spawn 10000 processes...
I suspect it was just a general recommendation. You're right. Nothing
is keeping you from launching more processes than you have processors.
Having said that, though, in general, it's a bad idea. Unless your
processes are spending a significant amount of time idle or blocked (eg.
doing I/O), then you will see significant slowdowns. Also, if another
job is on the same node, the processes might bump into each other, and
both will slow down.
> mpirun -hostfile $PBS_NODEFILE -np 2 ./a.out
> 3. My MPICH2 is version 1.2.1p1. I dont recall if i compiled it with
> torque support. Even so i dont' have a vairable $PBS_NODEFILE. (doing a
> "echo $PBS_NODEFILE" returns an empty line).
The $PBS_NODEFILE variable is only populated within a running job's
environment. It contains a path to a file that lists the nodes that
your job was assigned. So, inside of a job "echo $PBS_NODEFILE" should
give you the path to that temporary file. And "cat $PBS_NODEFILE" will
give you the contents.
> 4. I dont know if this is my problem or not but you talk about mpirun
> and mpiexec like if they were the same, yet i have used mpiexec most of
> the time and im not sure about the similiarities (or differences). You
> asked if my MPIEXEC is built with torque but a few lines below you
> mention MPIRUN
In the early days of MPICH (eg. MPICH1, not MPICH2), mpirun was provided
by MPICH, and mpiexec was something separate. I don't know if the same
holds true with MPICH2 or not; like Gustavo, I mostly use OpenMPI, where
they're both the same.
Given what you've told us so far, you potentially have two separate
problems: The scheduler, and the MPI process launching. It might make
the most sense to focus on just the scheduler for the time being.
What happens if the entire body of your job script is just a "cat
$PBS_NODEFILE", something like this:
> #PBS -q uno
> #PBS -l nodes=2:ppn=2,walltime=00:00:30
> echo "Nodes Assigned:"
> cat $PBS_NODEFILE
> echo "done"
More information about the torqueusers