[torquedev] Cpuset behavior in TORQUE 3.0

Troy Baer tbaer at utk.edu
Fri Jul 2 13:24:53 MDT 2010

On Fri, 2010-07-02 at 11:16 -0600, David Beer wrote:
> The current behavior for cpusets (and this has been the case from the
> beginning) is that when they fail, they fail silently. There is no
> checking, and the only form of notification is found in the log file.
> Nothing is done to verify that the correct number of cpus are
> configured, etc. 
> The current behavior is documented, but not necessarily the most
> desirable. There are possibilities like making sure that the actual
> number of cpus is the number that the user has configured when the
> server is compiled with cpusets enabled, or making sure /dev/cpuset is
> actually mounted for the moms that have cpusets enabled, etc. It just
> seems that if someone is compiling TORQUE specifically to use cpusets,
> TORQUE shouldn't let them think that cpusets are working when they
> aren't.

I'm all for more sanity checking, but please be careful that TORQUE does
not try to second-guess the administrator.  For instance, I'm in the
process of installing an SGI UV, and for various reasons the processors
in that system have hyperthreading enabled, so the OS reports twice as
many cores as are physically present.  If I tell TORQUE that that box
has 1024 cores, I don't want its pbs_mom configuring the torque cpuset
for 2048 cores just because that's what the OS reports.

It would be ideal if how TORQUE continued to handle its cpusets in a way
that's compatible with SGI's boot/user cpuset functionality [1].  I
realize that most people probably won't use cpusets that way, but the
current TORQUE cpusets implementation is easily compatible with SGI's
boot/user cpusets and I'd hate for that to change drastically.

One other thing I've noticed with cpusets in TORQUE 2.4.5 on the
small-ish UV system to which I currently have access is that the cleanup
of a job's cpuset after the job ends is *extremely* slow for jobs using
large numbers of cores (>100) on a node.  It looks like pbs_mom creates
a cpuset for every core allocated to the job underneath the job's
cpuset, and removing those takes significant time (several minutes in
some cases) because they're removed sequentially, one at a time.  Worse,
pbs_mom does not respond to network communications while this cleanup is
happening, so the node appears to be down from the POV of pbs_server and
the scheduler while the cpuset cleanup occurs.

[1] http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/linux/bks/SGI_Admin/books/LX_Resource_AG/sgi_html/ch04.html#Z1104954695tls

Troy Baer, HPC System Administrator
National Institute for Computational Sciences, University of Tennessee
Phone:  865-241-4233
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20100702/417639cd/attachment-0001.bin 

More information about the torquedev mailing list