[torqueusers] Torque-HA resource manager integration
Michael Robbert
mrobbert at mines.edu
Fri Jul 25 10:55:26 MDT 2008
Michael,
We have a similar setup. Here are the lines that we have in our Moab
config that appear to be relevant.
SCHEDCFG[clustername] SERVER=s01.domain.edu:42559
FBSERVER=s02.local:42559 MODE=NORMAL
RMCFG[base] TYPE=PBS
RMCFG[base] SUBMITCMD=/opt/torque/bin/qsub
So, it looks like RMCFG is only used to submit jobs and SCHEDCFG is used
to get data back from Torque.
Good luck,
Mike Robbert, Colorado School of Mines
Michael Sternberg wrote:
> How do I tell Moab or Maui how to schedule for an HA-Torque server pair?
>
>
> I use the following versions:
> torque-2.3.2-snap.200807092141
> moab server version 5.2.1 (revision 9490)
>
>
> Here's where I am: I've set up a torque-HA pair with server_priv/
> shared on an HA-NFS mount, following:
>
> http://www.clusterresources.com/wiki/doku.php?id=torque:4.3_server_high_availability
>
>
>
> Both pbs_server processes run, both have the shared lock file open,
> but only one at a time has TCP ports open, as it should:
>
> lsof -p `ps -ef | awk '/[p]bs_server/ {print $2}'`
>
> Killing either server makes the other reliably take over the ports and
> provide service. On a client, "qstat -a" shows the currently active
> server in the header, but always with s01 in the job spec, as
> designed. So far, so good.
>
>
> Now, what do I have to do on the moab/maui side? I am thinking along
> the lines of using the same or different names in "RMCFG[name]", and
> likewise, for EPORT. Is this the right place to look? I've tried in
> moab.cfg:
>
>
> (1) Same RM names, same port:
>
> RMCFG[baseha] TYPE=PBS HOST=s01 EPORT=15008
> RMCFG[baseha] TYPE=PBS HOST=s02 EPORT=15008
>
> Fails - no scheduling; every 30s in moab/stats/events.* :
>
> 09:36:55 1216996615 rm baseha RMDOWN cannot
> connect to RM
>
> I take it the second line overrides the first, and s02 happened to be
> the standby at the time, so no go.
>
>
> (2) One entry only, joined host names:
>
> RMCFG[baseha] TYPE=PBS HOST=s01,s02 EPORT=15008
>
> Fails - no scheduling; RMDOWN events every 30s.
>
> 09:39:59 1216996799 rm baseha RMDOWN cannot
> connect to RM
>
> OK, I guess that's a syntax error then. Same with "+" and ":" as a
> separator. (Inspired by lustre.)
>
>
> (3) Same RM name, different ports:
>
> RMCFG[baseha] TYPE=PBS HOST=s01 EPORT=15008
> RMCFG[baseha] TYPE=PBS HOST=s02 EPORT=15009
>
> Scheduling works only when s02 is active. Does not work when s01
> takes over:
>
> 09:50:34 1216997434 rm baseha RMDOWN cannot
> connect to RM
>
> Again, probably means the last definition for "RMCFG[]" wins.
>
>
> (4) *Different* RM names, same port:
>
> RMCFG[baseha1] TYPE=PBS HOST=s01 EPORT=15008
> RMCFG[baseha2] TYPE=PBS HOST=s02 EPORT=15008
>
> Scheduling works when either s01 or s02 is active. However, the
> standby RM is always reported as down.
>
> 09:54:00 1216997640 rm baseha2 RMDOWN cannot
> connect to RM
> ...
> 09:59:59 1216997999 rm baseha1 RMDOWN cannot
> connect to RM
>
>
> So, number (4) seems to work, but:
>
> (a) Is is safe?
>
> (b) Is it robust (e.g. during RM server failovers)?
>
> (c) The "RMDOWN" events for the standby RM will drown out critical
> failures, such as for notification:
>
> http://www.clusterresources.com/products/mwm/moabdocs/14.4eventmgmt.shtml
>
>
> Can I squelch exactly the log entries relating to the standby server?
>
> I looked at the following docs, but found nothing specific:
>
> http://www.clusterresources.com/products/mwm/moabdocs/14.2logging.shtml#eventformat
>
> http://www.clusterresources.com/products/mwm/moabdocs/a.fparameters.shtml#recordeventlist
>
>
>
> I also have Linux-HA working on the torque-HA node pair, and could
> provide a shared IP for the scheduler to talk to. However, as
> Linux-HA and pbs_server use different time constants and mechanisms to
> trigger failover, this can only lead to a mess when the service
> locations are incoherent.
>
>
> Regards, Michael
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list