[torqueusers] Torque cpusets messing up
sw77 at nyu.edu
Thu Mar 17 11:02:25 MDT 2011
What kind of MPI compilers have you used to compile the MPI code? Have you enabled CPU map for mpiexec?
To my experience with mvapich and mpiexec, there are 8 CPU cores per node, if I submit 2 jobs with 4 CPU cores per job on the same node, I have to include the flag
to mpiexec, otherwise all the 2 jobs will be run on the first 4 CPU cores.
On Mar 17, 2011, at 12:49 PM, R. David wrote:
> We had a long mail discussion a few weeks ago about MPI processes not correctly using Torque Cpusets.
> I still have the problem here.
> Here is what I could observe today :
> - Torque 2.5.4, Centos 5.3
> - 8 cores node, 1 core busy with a very long job (gaussian, running for 193 hours). This job has its own CPUset, of course, containing one core (core # 3)
> - I submit a job on the 7 available cores (qsub -l nodes=nodename:ppn=7). I get a 7-core cpuset : 0-2,4-7
> - I start the MPI job. 5 of the 7 MPI processes each get a core, going up to 100% CPU.
> - The 2 others seem to share a core, they don't go higher than 50% CPU.
> - I suspend (qsig -s suspend) the long single-core job, the MPI processes spread over 7 cores, each of the 7 processes get 100% of CPU
> - Resuming the long single-core job (qsig -s resume), it lands on the final available core, and rises again to 100% of CPU.
> - Stopping / starting again the 7 mpi processes => each of them get 100% of CPU.
> I don't understand what I had to suspend and resume the single-core job to have, finally, each of the 8 processes running on this node retrieving 100% of CPU time.
> Do you have any clue on this ?
> R. David
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers