[torquedev] cpuset memory enforcement mod
Chad Vizino
vizino at psc.edu
Thu Oct 30 21:05:57 MDT 2008
Greetings,
We've been running a customized 2.3.0 on our two SGI Altix 4700 systems
(144 and 768 cores) since April after making a few changes to repair a
couple of problems with cpuset cleanup (see my post from around then to
this list.) We have also gone on to significantly alter pbs_mom to work
in concert with a custom scheduler that attempts to pick contiguous cpus
for cpusets but that's another matter... :-)
For now, we assign full blades (each blade on our systems contains two
dual core sockets sharing 8GB memory) to jobs. Since we set
mem_exclusive (and of course cpu_exclusive) on our cpusets, we struggled
with how to best enforce memory limits per job--we wanted a job to be
limited to the total memory on the blades it's been assigned but use no
more (no paging!) After looking at cpuset memory_pressure and wanting
to take fairly aggressive action against paging, we cautiously decided
to alter pbs_mom to poll the memory_pressure for each cpuset and kill a
job with a descriptive message to stderr if it exceeds its assigned memory.
So far, this has been working quite well and thought others in the group
might be interested in seeing this change integrated in Torque. I'm not
sure it is right for general integration but could perhaps be an option
for those who would want to use it. Let me know what you think.
Hope to meet some of you at SC08.
Regards,
-Chad
Chad Vizino
Pittsburgh Supercomputing Center
More information about the torquedev
mailing list