Copyright © 2007, 2008 Cluster Resources, Inc.
Moabs and Torque can be used to manage the batch system for a Cray XT4 supercomputer. This document describes how Moab can be configured to use Torque and the native resource manager interface to bring Moab's unmatched scheduling capabilities to the Cray XT4.
Using xtopview, unpack the Torque tarball into the software directory in the shared root.
While still in xtopview, run configure with the options set appropriately for your installation. Run ./configure —help to see a list of configure options. CRI recommends installing the torque binaries into /opt/torque/$version and establishing a symbolic link to it from /opt/torque/default. At a minimum, you will need to specify the hostname where the torque server will run (--with-default-server) if it is different from the host it is being compiled on. The torque server host will normally be the sdb node for XT4 installations.
While still in xtopview, compile and install torque into the shared root. You may also need to link /opt/torque/default to this installation. Exit xtopview.
In this example we assume the torque server will be running on the sdb node. If you are installing torque with its server home in /var as in this example and assuming that your var filesystem is being served from your boot node under /snv, you will need to login to sdb and determine the nid with 'cat /proc/cray_xt/nid'.
Stage out the mom dirs and client server info on all login nodes. This example assumes you are using a persistent /var filesystems mounted from /snv on the boot node. Alternatively, a ram var filesystem must be populated by a skeleton tarball on the bootnode (/rr/current/.shared/var-skel.tgz) into which these files must be added. The example below assumes that you have 3 login nodes with nids of 4, 64 and 68. Place the hostname of the sdb node in the server_name file.
Example 6. Copy out mom dirs and client server info
# cd /rr/current/software/torque-2.2.0/tpackages/mom/var/spool
# for i in 4 64 68
> do cp -pr torque /snv/$i/var/spool
> echo nid00003 > /snv/$i/var/spool/torque/server_name
> # Uncomment the following if userids are not resolvable from the pbs_server host
> # echo "QSUBSENDUID true" > /snv/$i/var/spool/torque/torque.cfg
> done
Configure the torque server by informing it of its hostname and running the torque.setup script.
Add access and submit permission from your login nodes. You will need to enable host access by setting acl_host_enable to true and adding the nid hostnames of your login nodes to acl_hosts. In order to be able to submit from these same login nodes, you need to add them as submit_hosts and this time use their hostnames as returned from the hostname command.
Example 8. Customize server settings
Enable scheduling to allow Torque events to be sent to Moab. Note: If this is not set, Moab will automatically set it on startup.
# qmgr -c "set server scheduling = true"
Keep information about completed jobs around for a time so that Moab can detect and record their completion status. Note: If this is not set, Moab will automatically set it on startup.
# qmgr -c "set server keep_completed = 300"
Set the default node count for a job to be 1.
# qmgr -c "set server resources_default.nodes = 1"
Set resources_available.nodes equal to the maximum number of procs that can be requested in a job.
# qmgr -c "set server resources_available.nodes = 12500"
Do this for each queue individually as well.
# qmgr -c "set queue batch resources_available.nodes = 12500"
Only allow jobs submitted from hosts specified by the acl_hosts parameter.
# qmgr -c "set server acl_host_enable = true"
# qmgr -c "set server acl_hosts += nid00004"
# qmgr -c "set server acl_hosts += nid00064"
# qmgr -c "set server acl_hosts += nid00068"
# qmgr -c "set server submit_hosts += login1"
# qmgr -c "set server submit_hosts += login2"
# qmgr -c "set server submit_hosts += login3"
# #qmgr -c "set server disable_server_id_check = true"
Define your login nodes to torque. You should set np to the number of cores on your system.
Torque provides an init.d script for starting pbs_server as a service.
Example 10. Copy in init.d script
# cd /rr/current/software/torque-2.2.0
# cp contrib/init.d/pbs_server /etc/init.d
# chmod +x /etc/init.d/pbs_server
Edit the init.d file as necessary -- i.e. change PBS_DAEMON and PBS_HOME as appropriate.
# vi /etc/init.d/pbs_server
PBS_DAEMON=/opt/torque/default/sbin/pbs_server
PBS_HOME=/var/spool/torque
Torque provides an init.d script for starting pbs_mom as a service.
Example 11. Copy in init.d script
# cd /rr/current/software/torque-2.2.0
Edit the init.d file as necessary -- i.e. change PBS_DAEMON and PBS_HOME as appropriate.
# vi contrib/init.d/pbs_mom
PBS_DAEMON=/opt/torque/default/sbin/pbs_mom
PBS_HOME=/var/spool/torque
# pdcp -w login1,login2,login3 contrib/init.d/pbs_mom /etc/init.d
# pdsh -w login1,login2,login3 chmod +x /etc/init.d/pbs_mom
Moab provides module files that can be used to establish the proper Torque environment. You may wish to copy this out onto the login nodes as well.
If Torque is not already installed on your system, follow the Torque-XT4 Installation Notes to install Torque on the sdb node.
Download the latest Moab release from Cluster Resources, Inc.
Note: The correct tarball type can be recognized by the xt4 tag in its name.
Using xtopview, unpack the Moab tarball into the software directory in the shared root.
While still in xtopview, run configure with the options set appropriately for your installation. Run ./configure —help to see a list of configure options. CRI recommends installing the moab binaries into /opt/moab/$version and establishing a symbolic link to it from /opt/moab/default. Since the moab home directory must be read-write by root, CRI recommends you specify the homedir in a location such as /var/spool/moab.
While still in xtopview, install moab into the shared root. You may also need to link /opt/moab/default to this installation.
Moab provides a module file that can be used to establish the proper Moab environment. You may also want to install these module files onto the login nodes.
Moab's native resource manager interface scripts require a Perl XML Module to communicate via the basil interface. The Perl XML::LibXML module should be installed. The default method is to use the perldeps make target to install a bundled version of the module into a local Moab lib directory. This module may also be downloaded and installed from Perl's CPAN directory. Exit xtopview.
In this example we assume the moab server will be running on the sdb node. If you are installing moab with its server home in /var as in this example and assuming that your var filesystem is being served from your boot node under /snv, you will need to login to sdb and determine the nid with 'cat /proc/cray_xt/nid'.
The moab.cfg file should be customized for your scheduling environment. See the Moab Admin Guide for more details.
The only essential parameter is the SCHEDCFG line so the clients can find the server. This example assumes you are using a persistent /var filesystems mounted from /snv on the boot node and that your login nodes have nids of 4, 64 and 68. Alternatively, a ram var filesystem must be populated by a skeleton tarball on the bootnode (/rr/current/.shared/var-skel.tgz) into which these files must be added.
The resource manager native interface tools are located in the $prefix/tools directory by default and consist of a configuration file (config.xt4.pl) and various scripts (job.query.xt4.pl, node.query.xt4.pl, job.start.xt4.pl, job.cancel.xt4.pl, ...). Edit the configuration file to apply to your system environment.
Example 25. Edit the XT4 configuration file
# cd /rr/current/opt/moab/default/tools
# vi config.xt4.pl
$ENV{PATH} = "/opt/torque/default/bin:/usr/bin:$ENV{PATH}";
$loginPattern = "^login"; # These are the login nodes used by interactive jobs
$yodPattern = "^login"; # These are the nodes running pbs_mom
Moab provides an init.d script for starting Moab as a service. Using xtopview into the sdb node, copy the init script into /etc/init.d.
The MOABHOMEDIR environment variable must be set in your environment when starting moab or using moab commands. You will also want to adjust your path to include the moab and torque bin and sbin directories. The proper environment can be established by loading the appropriate moab module, by sourcing properly edited login files, or by directly modifying your environment variables.
It is preferable to have no running jobs during the upgrade. This can be done by closing all queues in Torque or setting a system reservation in Moab and waiting for all jobs to complete. Often, it is possible to upgrade Torque with running jobs in the system, but you may risk problems associated with Torque being down when the jobs complete and incompatibilities between the new and old file formats and job states.
Using xtopview, unpack the Torque tarball into the software directory in the shared root.
While still in xtopview, run configure with the options set appropriately for your installation. Run ./configure —help to see a list of configure options. CRI recommends installing the torque binaries into /opt/torque/$version and establishing a symbolic link to it from /opt/torque/default. At a minimum, you will need to specify the hostname where the torque server will run (--with-default-server) if it is different from the host it is being compiled on. The torque server host will normally be the sdb node for XT4 installations.
While still in xtopview, compile and install torque into the shared root. You may also need to link /opt/torque/default to this installation. Exit xtopview.
Example 35. Make and Make Install
default/:/software/torque-2.2.0 # make
default/:/software/torque-2.2.0 # make packages
default/:/software/torque-2.2.0 # make install
default/:/software/torque-2.2.0 # rm /opt/torque/default
default/:/software/torque-2.2.0 # ln -sf /opt/torque/2.2.0/ /opt/torque/default
default/:/software/torque-2.2.0 # exit
Note: If you have still have running jobs, you will want to start pbs_mom with the -p flag to preserve running jobs. By default, the init.d startup script will not preserve running jobs unless altered to start pbs_mom with the -p flag.
On the boot node as root:
It is preferable to have no running jobs during the upgrade. This can be done by setting a system reservation in Moab and waiting for all jobs to complete. Often, it is possible to upgrade Moab with running jobs in the system, but you may risk problems associated with Moab being down when the jobs complete.
Download the latest Moab release from Cluster Resources, Inc.
Note: The correct tarball type can be recognized by the xt4 tag in its name.
Using xtopview, unpack the Moab tarball into the software directory in the shared root.
While still in xtopview, run configure with the options set appropriately for your installation. Run ./configure —help to see a list of configure options. CRI recommends installing the moab binaries into /opt/moab/$version and establishing a symbolic link to it from /opt/moab/default. Since the moab home directory must be read-write by root, CRI recommends you specify the homedir in a location such as /var/spool/moab.
While still in xtopview, install moab into the shared root. You may also need to link /opt/moab/default to this installation.
If you have previously installed the perl modules in the perl site directories (configure --with-perl-libs=site), you should not need to remake the perl modules. However, the default is to install the perl modules local to the moab install directory and since it is normal practice to configure the moab upgrade to use a new install directory (configure --prefix), it will generally be necessary to reinstall the perl modules. Exit xtopview when done with this step.
If the upgrade brings in new changes to the config.xt4.pl file, you will need to edit the file and manually merge in the changes from the config.xt4.pl.dist file. One way to discover if new changes have been introduced is to diff the config.xt4.pl.dist from the old and new tools directories. This is rare, but does happen on occasion. One will generally discover quite quickly if necessary changes were not made because the xt4 scripts will usually fail if the config file has not been updated.