[torqueusers] Preventing compute node "starvation"

Dave Ulrick d-ulrick at comcast.net
Wed Sep 18 09:52:36 MDT 2013


Hi,

We're running TORQUE 4.2.3.1 with Moab 6.1.5 on a RHEL 6.2 Linux HPC. It 
has 60 compute nodes. Each has 12 CPU cores and 2 NVidia GPUs. I use 
Nagios to monitor various services on the cluster including the TORQUE MOM 
daemons.

Occasionally, a user will run a job that taxes a node's resources so much 
that Nagios fails to get a timely response from the plugin that checks the 
health of the MOM so it flags the service as having a critical issue. 
Whenever this happens, I can usually count on the node having other issues 
such as a failure to accept SSH connections in a timely manner. I am 
concerned that we might see unpredictable problems as fallout if TORQUE 
and other crucial system processes are rendered unable to communicate with 
a node for a prolonged period of time.

My theory is that processes that use a great deal of CPU and/or I/O might 
monopolize a node's resources so much that system processes such as the 
TORQUE MOM, sshd, Nagios NRPE, etc., are "starved" for resources. Any 
suggestions on how I might tune my compute nodes to prevent "starvation" 
without unduly impacting the performance of the computational jobs that 
are my HPC's reason for being?

Thanks,
Dave
-- 
Dave Ulrick
Email: d-ulrick at comcast.net


More information about the torqueusers mailing list