Appendix F: Large Clusters Considerations
F.1 Communication Overview
TORQUE has enhanced much of the communication found in the original OpenPBS project. This has resulted in a number of key advantages:
In most cases, enhancements made apply to all systems and no tuning is required. However, some changes have been made configurable to allow site specific modification. The configurable communication parameters are:
F.2 Scalability Guidelines
In very large clusters (in excess of 1,000 nodes), it may be advisable to additionally tune a number of communication layer timeouts. By default, PBS MOM daemons will timeout on inter-mom messages after 60 seconds. In TORQUE 1.1.0p5 and higher, this can be adjusted by setting the timeout parameter in the mom_priv/config file. If 15059 errors (cannot receive message from sisters) are seen in the mom logs, it may be necessary to increase this value.
F.3 End User Command Caching
F.3.1 qstatIn a large system, users may tend to place excessive load on the system by manual or automated use of resource manager end user client commands. A simple way of reducing this load is through the use of client command wrappers which cache data. The example script below will cache the output of the command 'qstat -f' for 60 seconds and report this info to end users.
#!/bin/sh # USAGE: qstat $@ CMDPATH=/usr/local/bin/qstat CACHETIME=60 TMPFILE=/tmp/qstat.f.tmp if [ "$1" != "-f" ] ; then #echo "direct check (arg1=$1) " $CMDPATH $1 $2 $3 $4 exit $? fi if [ -n "$2" ] ; then #echo "direct check (arg2=$2)" $CMDPATH $1 $2 $3 $4 exit $? fi if [ -f $TMPFILE ] ; then TMPFILEMTIME=`stat -c %Z $TMPFILE` else TMPFILEMTIME=0 fi NOW=`date +%s` AGE=$(($NOW - $TMPFILEMTIME)) #echo AGE=$AGE for i in 1 2 3;do if [ "$AGE" -gt $CACHETIME ] ; then #echo "cache is stale " if [ -f $TMPFILE.1 ] ; then #echo someone else is updating cache sleep 5 NOW=`date +%s` TMPFILEMTIME=`stat -c %Z $TMPFILE` AGE=$(($NOW - $TMPFILEMTIME)) else break; fi fi done if [ -f $TMPFILE.1 ] ; then #echo someone else is hung rm $TMPFILE.1 fi if [ "$AGE" -gt $CACHETIME ] ; then #echo updating cache $CMDPATH -f > $TMPFILE.1 mv $TMPFILE.1 $TMPFILE fi #echo "using cache" cat $TMPFILE exit 0
The above script can easily be modified to cache any command and any combination of args by changing one or more of the following attributes:
For example, to cache the command pbsnodes -a, make the following changes:
F.5 Other Considerations
F.5.1 job_stat_rateIn a large system, there may be many users, many jobs, and many requests for information. To speed up response time for users and for programs using the API the job_stat_rate can be used to tweak when the pbs_server daemon will query moms for job information. By increasing this number, a system will not be constantly querying job information and causing other commands to block.
F.5.2 poll_jobsThe poll_jobs parameter allows a site to configure how the pbs_server daemon will poll for job information. When set to TRUE, the pbs_server will poll job information in the background and not block on user requests. When set to FALSE, the pbs_server may block on user requests when it has stale job information data. Large clusters should set this parameter to TRUE.
F.5.3 Internal SettingsOn large, slow, and/or heavily loaded systems, it may be desirable to increase the pbs_tcp_timeout setting used by the pbs_mom daemon in mom-to-mom communication. This setting defaults to 20 seconds and requires rebuilding code to adjust. For client-server based communication, this attribute can be set using the qmgr command. For mom-to-mom communication, a source code modification is required. To make this change, edit the $TORQUEBUILDDIR/src/lib/Libifl/tcp_dis.c file and set pbs_tcp_timeout to the desired maximum number of seconds allowed for a mom-to-mom request to be serviced.
F.5.4 Scheduler SettingsIf using Moab, there are a number of parameters which can be set which may improve TORQUE performance. In an environment containing a large number of short-running jobs, the JOBAGGREGATIONTIME parameter (see Appendix F of the Moab Workload Manager Administrator's Guide) can be set to reduce the number of workload and resource queries performed by the scheduler when an event based interface is enabled. If the pbs_server daemon is heavily loaded and PBS API timeout errors (ie. 'Premature end of message') are reported within the scheduler, the TIMEOUT attribute of the RMCFG parameter (see Appendix F of the Moab Workload Manager Administrator's Guide) may be set with a value of between 30 and 90 seconds.
F.5.5 File SystemTORQUE can be configured to disable file system blocking until data is physically written to the disk by using the --disable-filesync argument with configure. While having filesync enabled is more reliable, it may lead to server delays for sites with either a larger number of nodes, or a large number of jobs. Filesync is enabled by default.
F.5.6 Network ARP cacheFor networks with more than 512 nodes it is mandatory to increase the kernel’s internal ARP cache size. For a network of ~1000 nodes, we use these values in /etc/sysctl.conf on all nodes and servers:
# Don't allow the arp table to become bigger than this net.ipv4.neigh.default.gc_thresh3 = 4096 # Tell the gc when to become aggressive with arp table cleaning. # Adjust this based on size of the LAN. net.ipv4.neigh.default.gc_thresh2 = 2048 # Adjust where the gc will leave arp table alone net.ipv4.neigh.default.gc_thresh1 = 1024 # Adjust to arp table gc to clean-up more often net.ipv4.neigh.default.gc_interval = 3600 # ARP cache entry timeout net.ipv4.neigh.default.gc_stale_time = 3600
Use sysctl -p to reload this file.
The ARP cache size on other UNIXes can presumably be modified in a similar way;
An alternative approach is to have a static /etc/ethers file with all hostnames and MAC addresses and load this by arp -f /etc/ethers. However, maintaining this approach is quite cumbersome when nodes get new MAC addresses (due to repairs, for example).
|© 2001-2010 Adaptive Computing Enterprises, Inc.|