TORQUE Resource Manager

TORQUE Administrator's Manual - 4.3 Server High Availability

4.3 Server High Availability

The option of running TORQUE in a redundant or high availability mode has been implemented. This means that there can be multiple instances of the server running and waiting to take over processing in the event that the currently running server fails.

Note that the high availability feature is available in the 2.3 and 2.4(trunk) versions of Torque.

Multiple server host machines

Two server host machines can be running pbs_server at the same time. The two servers have their torque/server_priv directory mounted on a shared NFS file system. The pbs_server need to be started with the --ha command line option which will allow two servers to be running at the same time. Only the first server to start will complete the full startup. The second server to start will block very early in the startup when it tries to lock the file torque/server_priv/server.lock. When the second server cannot obtain the lock, it will spin in a loop and wait for the lock to clear. The sleep time between checks of the lock file is one second.

Notice that not only can the servers be running on independent server hardware, there can also be multiple instances of the pbs_server running on the same machine. This was not possible before as the second one to start would always write an error and quit when it could not obtain the lock.

How commands select the correct server host

The various commands that send messages to pbs_server usually have an option of specifying the server name on the command line or if none is specified, will use the default server name. The default server name comes either from the environment variable PBS_DEFAULT or from the file torque/server_name.

The definition of the contents of the file torque/server_name has been extended to allow this specification to be a comma separated list of server names.

When a command is executed and no explicit server is mentioned, an attempt will be made to connect to the first server name in the list. If this fails, then the second server name will be tried. If both servers are unreachable, an error is returned and the command fails.

Note that there is a period of time after the failure of the current server during which the new server is starting up where it is unable to process commands. The new server must read the existing configuration and job infromation from the disk and so the length of this time is based on the size of the disk based state information. Commands issued during this period of time might fail due to timeouts expiring.

Job names

One aspect of this enhancement is in the construction of job names. Job names normally contain the name of the host machine where pbs_server is running. Now when job names are constructed, only the first name from the server specification list is used in building the job name.

Persistence of the pbs_server process

The system adminstrator must ensure that pbs_server continues to run on the server nodes. This could be as simple as a cron job that counts the number of pbs_server's in the process table and starts some more if needed.

High availability of the NFS server

One consideration of this implemention is that it depends on NFS file system also being redundant. NFS can be set up as a redundant service. See the following.

There are also other ways to set up a shared file system. See the following.

Example configuration

The following section will describe the test setup used to verify the operation of this new feature. Three machines were my desktop machine, jakaa, and two lab machines, node12 and node13, where the pbs_server's were resident. Commands were submitted from my desktop machine, jakaa. This machine also ran pbs_mom and pbs_sched. It was also the NFS server for the shared server_priv directory.

The NFS setup on jakaa is show below. This allows the entire torque directory structure to be shared although the server machines only shared the server_priv part of the share.

NFS exports
root@jakaa:/etc# cat exports 
# /etc/exports: the access control list for filesystems which may be exported 
#               to NFS clients.  See exports(5). 
# 
# Example for NFSv2 and NFSv3: 
# /srv/homes       hostname1(rw,sync) hostname2(ro,sync) 
# 
# Example for NFSv4: 
# /srv/nfs4        gss/krb5i(rw,sync,fsid=0,crossmnt) 
# /srv/nfs4/homes  gss/krb5i(rw,sync) 
# 
/tmp            *.cridomain(rw,sync) 
/var/spool/torque               *.cridomain(rw,sync,no_root_squash) 

Next is shown the fstab file that describes how the mounts were done on node12 and node13.

FSTAB file contents
root@node12:/var/spool/torque/mom_priv# cat /etc/fstab 
# /etc/fstab: static file system information. 
# 
# <file system> <mount point>   <type>  <options>       <dump>  <pass> 
proc            /proc           proc    defaults        0       0 
/dev/hda1       /               ext3    defaults,errors=remount-ro 0       1 
/dev/hda5       none            swap    sw              0       0 
jakaa:/var/spool/torque/server_priv      /var/spool/torque/server_priv        nfs     bg,intr,soft,rw        0       0 

Nodes 12 and 13 each had their own /var/spool/torque directories. The NFS mount just replaced the server_priv subdirectory with the shared one on jakaa. As far as the local configuration one nodes 12 and 13, the only thing done was to setup the server_name file.

Contents of the torque directory
root@node12:/var/spool/torque# ll 
total 56 
drwxr-xr-x 12 root root 4096 2007-11-28 15:40 ./ 
drwxr-xr-x  4 root root 4096 2007-12-05 10:14 ../ 
drwxr-xr-x  2 root root 4096 2007-12-04 15:24 aux/ 
drwx------  2 root root 4096 2007-11-28 15:40 checkpoint/ 
drwxr-xr-x  2 root root 4096 2007-12-06 00:01 mom_logs/ 
drwxr-x--x  3 root root 4096 2007-11-30 11:22 mom_priv/ 
-rw-r--r--  1 root root   36 2007-11-28 15:40 pbs_environment 
drwxr-xr-x  2 root root 4096 2007-12-04 10:26 sched_logs/ 
drwxr-x---  3 root root 4096 2007-12-07 10:08 sched_priv/ 
drwxr-xr-x  2 root root 4096 2007-12-07 00:07 server_logs/ 
-rw-r--r--  1 root root   34 2007-12-06 11:33 server_name 
drwxr-x--- 12 root root 4096 2007-12-07 11:38 server_priv/ 
drwxrwxrwt  2 root root 4096 2007-12-05 15:40 spool/ 
drwxrwxrwt  2 root root 4096 2007-12-04 12:47 undelivered/ 

root@node12:/var/spool/torque# cat server_name 
node12.cridomain,node13.cridomain 

File torque/server_name
root@jakaa:/var/spool/torque# cat server_name 
node12,node13 

The following shows the setup of the batch queue.

PBS server configuration
root@node12:/var/spool/torque# qmgr -c 'p s' 
# 
# Create queues and set their attributes. 
# 
# 
# Create and define queue batch 
# 
create queue batch 
set queue batch queue_type = Execution 
set queue batch resources_default.nodes = 1 
set queue batch resources_default.walltime = 01:00:00 
set queue batch resources_available.nodect = 999999 
set queue batch enabled = True 
set queue batch started = True 
# 
# Set server attributes. 
# 
set server scheduling = True 
set server acl_host_enable = True 
set server acl_hosts = jakaa.cridomain 
set server acl_hosts += node13.cridomain 
set server acl_hosts += node12.cridomain 
set server acl_hosts += jakaa 
set server managers = root@node12.cridomain 
set server managers += ssnelgrove@jakaa.cridomain 
set server managers += ssnelgrove@node12.cridomain 
set server operators = root@node12.cridomain 
set server operators += ssnelgrove@node12.cridomain 
set server default_queue = batch 
set server log_events = 511 
set server mail_from = adm 
set server scheduler_iteration = 600 
set server node_check_rate = 150 
set server tcp_timeout = 6 
set server mom_job_sync = True 
set server pbs_version = 2.2.2 
set server keep_completed = 300 
set server submit_hosts = jakaa 
Note that in this setup, we had to explicitly add node12 and node13 to the acl_hosts lists. This requirement was removed by having the code automatically add the names in the server_name file to the acl_hosts list.

Next are shown the processes running on all of the machines.

Running instances of PBS processes
root@jakaa:/var/spool/torque# ps -ef|grep pbs 
root     11548     1  0 11:34 ?        00:00:00 pbs_mom 
root     19305     1  0 11:42 ?        00:00:00 pbs_sched 

root@node12:/var/spool/torque# ps -ef|grep pbs 
root      4992     1  0 11:38 ?        00:00:00 pbs_server --ha 

root@node12:/var/spool/torque# ps -ef|grep pbs 
root     15190     1  0 13:11 pts/0    00:00:00 pbs_server --ha