[torqueusers] Torque-HA resource manager integration

Josh Butikofer josh at clusterresources.com
Wed Jul 30 08:09:59 MDT 2008


Michael and Michael,

The configuration below that Michael Robbert gave will work for communicating with a TORQUE in HA 
mode, but not for the reasons that Mr. Robbert assumed.

The TORQUE libraries have access to the server_name file found in TORQUE's configuration/spool 
directory. This file contains the primary and secondary TORQUE servers. What this means is Moab can 
communicate with the TORQUE libraries and the libraries will resolve which server it should 
communicate with (depending on which one is currently running with an open socket). Moab does not 
even have to know that TORQUE is running in HA mode--it should just work.

In other words, you should only need a single RMCFG[] line configuring a single TORQUE RM, as is 
shown in Mr. Robbert's moab.cfg file. The RMCFG[] lines do indeed control both submission and data 
querying from TORQUE (and other resource managers).

As for the SCHEDCFG[] line, these parameters only affect the Moab scheduler configuration. The 
additional FBSERVER=s02.local:42559 is telling Moab that a secondary Moab Workload Manager daemon is 
running on s02.local, port 42559. This config only controls Moab's HA, not TORQUE HA.

Hopefully that makes sense.

 >> From Michael Sternberg:
 >> I also have Linux-HA working on the torque-HA node pair, and could
 >> provide a shared IP for the scheduler to talk to.  However, as
 >> Linux-HA and pbs_server use different time constants and mechanisms to
 >> trigger failover, this can only lead to a mess when the service
 >> locations are incoherent.

If you want to use Linux-HA (which I think is a fine idea), you should probably disable TORQUE's HA 
mechanism. As you mentioned, using them both together is messy. I would only use one or the other.

Regards,

Josh Butikofer



Michael Robbert wrote:
> Michael,
> We have a similar setup. Here are the lines that we have in our Moab 
> config that appear to be relevant.
> 
> SCHEDCFG[clustername] SERVER=s01.domain.edu:42559 
> FBSERVER=s02.local:42559 MODE=NORMAL
> 
> RMCFG[base]             TYPE=PBS
> RMCFG[base]             SUBMITCMD=/opt/torque/bin/qsub
> 
> So, it looks like RMCFG is only used to submit jobs and SCHEDCFG is used 
> to get data back from Torque.
> 
> Good luck,
> Mike Robbert, Colorado School of Mines
> 


More information about the torqueusers mailing list