[torqueusers] cpusets

Martin Siegert siegert at sfu.ca
Wed Nov 30 12:14:23 MST 2011


Hi,

we just recently started using cpusets and I do not have much experience
with them. However, by now I noticed several times that MPI jobs
(openmpi with TM) slow down dramatically: apparently two processes
are using the same core (i.e., both only get 50% cpu usage) even though
the number of cores in the cpuset equals the number of processes
of the mpi job on the particular node.

E.g.,

top - 11:05:24 up 42 days, 22:43,  2 users,  load average: 6.99, 6.93, 6.68
Tasks: 468 total,   8 running, 460 sleeping,   0 stopped,   0 zombie
Cpu(s): 24.9%us,  0.2%sy,  0.0%ni, 74.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  24675188k total, 12099684k used, 12575504k free,    69968k buffers
Swap: 16777208k total,    29932k used, 16747276k free,  9946292k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 3717 user1     25   0  183m  91m  14m R 100.0  0.4  15:43.62 Clark
 4526 user2     25   0  109m  36m 3088 R 100.0  0.2   2:02.43 mdrun
15863 user3     25   0  459m 163m  15m R 100.0  0.7 711:26.30 wrfm_arw.exe
15864 user3     25   0  452m 156m  15m R 100.0  0.6 688:28.80 wrfm_arw.exe
 4562 user2     25   0  109m  36m 3088 R 99.7  0.2   0:23.02 mdrun
15861 user3     25   0  462m 165m  15m R 50.2  0.7 510:02.12 wrfm_arw.exe
15862 user3     25   0  465m 169m  15m R 49.9  0.7 446:21.37 wrfm_arw.exe

root at b311:~> cat /proc/15861/cpuset 
/torque/4913985.b0
root at b311:~> cat /proc/15862/cpuset 
/torque/4913985.b0

(same for 15863, 15864) and

root at b311:~> ls /dev/cpuset//torque/4913985.b0
68  cpu_exclusive   memory_pressure     notify_on_release
69  cpus            memory_spread_page  sched_relax_domain_level
70  mem_exclusive   memory_spread_slab  tasks
71  memory_migrate  mems
root at b311:~> cat /dev/cpuset/torque/4913985.b0/cpus
0-1,4,8

Do processes within a cpuset get bound to a particular cpu?
If yes, how do I find out which one?

Anyway, if you have na idea what could be causing this and how to
solve this problem, please let me know.

Thanks!

Cheers,
Martin

-- 
Martin Siegert
Simon Fraser University


More information about the torqueusers mailing list