[torqueusers] Preventing compute node "starvation"
dbeer at adaptivecomputing.com
Wed Sep 18 10:06:54 MDT 2013
This might involve a bit of manual work, but you could do this with a root
cpuset. The idea in a nutshell is to reserve a core or two for the OS,
nagios, pbs_mom daemon, etc. and let jobs use the rest. The way to do this
on a 12 core node would be to enable cpusets in TORQUE and in the nodes
file say that it only has 11 cores (or 10 if you want to reserve 2 for
these things). Then, you can either trust the OS to load balance these
other processes to the unused core or you can manually make sure that these
processes run under that cpuset.
As far as whether or not this is needed - pbs_mom should use a minimal
amount of resources once a job is actually active. The amount it uses can
be greater if it is a larger node and you are running lots of small jobs on
the node, but even so it shouldn't be a huge amount of resources. The only
things pbs_mom should need to do for a node that is already filled with
jobs is send a status to pbs_server every 45 seconds by default (this can
be configured) and respond to pbs_server's poll requests every 45 seconds
(this is also configurable). There will be one poll request per job. I
don't know how much cpu nagios uses, but typically people haven't had to
use this solution except on large-scale numa systems (usually > 1000 cores)
which have a little better support for doing it and are often running many
more jobs per node.
On Wed, Sep 18, 2013 at 9:52 AM, Dave Ulrick <d-ulrick at comcast.net> wrote:
> We're running TORQUE 220.127.116.11 with Moab 6.1.5 on a RHEL 6.2 Linux HPC. It
> has 60 compute nodes. Each has 12 CPU cores and 2 NVidia GPUs. I use
> Nagios to monitor various services on the cluster including the TORQUE MOM
> Occasionally, a user will run a job that taxes a node's resources so much
> that Nagios fails to get a timely response from the plugin that checks the
> health of the MOM so it flags the service as having a critical issue.
> Whenever this happens, I can usually count on the node having other issues
> such as a failure to accept SSH connections in a timely manner. I am
> concerned that we might see unpredictable problems as fallout if TORQUE
> and other crucial system processes are rendered unable to communicate with
> a node for a prolonged period of time.
> My theory is that processes that use a great deal of CPU and/or I/O might
> monopolize a node's resources so much that system processes such as the
> TORQUE MOM, sshd, Nagios NRPE, etc., are "starved" for resources. Any
> suggestions on how I might tune my compute nodes to prevent "starvation"
> without unduly impacting the performance of the computational jobs that
> are my HPC's reason for being?
> Dave Ulrick
> Email: d-ulrick at comcast.net
> torqueusers mailing list
> torqueusers at supercluster.org
David Beer | Senior Software Engineer
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers