[torqueusers] Torque cpusets messing up
david at unistra.fr
Thu Mar 17 10:49:50 MDT 2011
We had a long mail discussion a few weeks ago about MPI processes not correctly using Torque Cpusets.
I still have the problem here.
Here is what I could observe today :
- Torque 2.5.4, Centos 5.3
- 8 cores node, 1 core busy with a very long job (gaussian, running for 193 hours). This job has its own CPUset, of course, containing one core (core # 3)
- I submit a job on the 7 available cores (qsub -l nodes=nodename:ppn=7). I get a 7-core cpuset : 0-2,4-7
- I start the MPI job. 5 of the 7 MPI processes each get a core, going up to 100% CPU.
- The 2 others seem to share a core, they don't go higher than 50% CPU.
- I suspend (qsig -s suspend) the long single-core job, the MPI processes spread over 7 cores, each of the 7 processes get 100% of CPU
- Resuming the long single-core job (qsig -s resume), it lands on the final available core, and rises again to 100% of CPU.
- Stopping / starting again the 7 mpi processes => each of them get 100% of CPU.
I don't understand what I had to suspend and resume the single-core job to have, finally, each of the 8 processes running on this node retrieving 100% of CPU time.
Do you have any clue on this ?
More information about the torqueusers