[torqueusers] Preventing compute node "starvation"
d-ulrick at comcast.net
Wed Sep 18 09:52:36 MDT 2013
We're running TORQUE 18.104.22.168 with Moab 6.1.5 on a RHEL 6.2 Linux HPC. It
has 60 compute nodes. Each has 12 CPU cores and 2 NVidia GPUs. I use
Nagios to monitor various services on the cluster including the TORQUE MOM
Occasionally, a user will run a job that taxes a node's resources so much
that Nagios fails to get a timely response from the plugin that checks the
health of the MOM so it flags the service as having a critical issue.
Whenever this happens, I can usually count on the node having other issues
such as a failure to accept SSH connections in a timely manner. I am
concerned that we might see unpredictable problems as fallout if TORQUE
and other crucial system processes are rendered unable to communicate with
a node for a prolonged period of time.
My theory is that processes that use a great deal of CPU and/or I/O might
monopolize a node's resources so much that system processes such as the
TORQUE MOM, sshd, Nagios NRPE, etc., are "starved" for resources. Any
suggestions on how I might tune my compute nodes to prevent "starvation"
without unduly impacting the performance of the computational jobs that
are my HPC's reason for being?
Email: d-ulrick at comcast.net
More information about the torqueusers