[torqueusers] Server not talking to MOMs at all
Stewart.Samuels at sanofi-aventis.com
Stewart.Samuels at sanofi-aventis.com
Wed Sep 7 10:09:17 MDT 2005
Sorry for the retransmission of this message but I think my response got appended to Troy's response to Garrick.
Troy, to answer your question. In my case, I would like at least two servers in an active/active HA mode that can be accessed by the user through a load balancing mechanism to provide a robust basis for a virtualized computing environment. That is, the user interfaces with a single hostname representing a server. The user does not need to know that the server is really a cluster nor what types of nodes are contained in the cluster. Only that it is a compute resource capable of executing their jobs. At the same time, as the cluster grows in size (and expense), I clearly want this system available to the users as much as possible and to maintain state. Obviously, one head node is a single point of failure to the entire system. So, multiple nodes helps to alleviate this.
I realize there are several ways of failing over systems. But without multi-server support, how do the servers retain the state of the system (running and queued jobs) when head nodes are failed over using active/passive methods? These are some of the questions I am currently investigating and would welcome any solutions the community has successfully implemented.
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org]On Behalf Of Troy Baer
Sent: Wednesday, September 07, 2005 10:01 AM
To: Garrick Staples
Cc: torqueusers at supercluster.org
Subject: Re: [torqueusers] Server not talking to MOMs at all
On Sat, 2005-09-03 at 15:31 -0700, Garrick Staples wrote:
> On Sat, Sep 03, 2005 at 04:22:00PM -0600, Dave Jackson alleged:
> > Garrick,
> > > > The first $clienthost listed identifies the "server" to the MOM. It
> > is the
> > > > > only hostname that will receive status updates from the MOM.
> > > >
> > > > I would argue that this behavior is somewhere between counter-intuitive
> > > > and broken, even if it has been in PBS since the beginning of time. :)
> > We would like to add a 'synonym parameter to $clienthost called
> > '$headnode'. It would behave exactly like $clienthost except for the
> > 'confusing both old and new alike' part.
> > Is this the best name? Thoughts?
> I just noticed that '$pbsserver' is already a synonym for '$clienthost'.
> I don't know how long it's been there, but it looks "post OpenPBS" to
> me. I suppose that is as suitable as '$headnode'; and someone,
> somewhere, already has '$pbsserver' in their config files.
The problem IMHO is that the pbs_mom code has a single array called
pbs_servername that appears to be an ACL of hosts allowed to talk to
the pbs_mom daemon, including both the server(s) *AND* all the other
moms without differentiating between the two (except that the very first
one is special). I would argue that the correct solution to this is to
add a second ACL called pbs_clientname that's the list of moms the
local mom daemon is allowed to talk to. That requires $clienthost and
$pbsserver to do different things when the mom config file is parsed,
but it makes multi-server easier insofar as it's now kosher for pbs_mom
to send utilization data to every host listed in pbs_servername. (I
would also argue that the contents of $PBS_HOME/server_name needs to be
BTW, the $headnode terminology IMHO doesn't make sense in an environment
where the system running pbs_server (and likely pbs_sched/maui/moab as
well) is *NOT* a login node, although I'm not sure if that describes
anyone's system structure other than ours. In any case, I would argue
that $pbsserver (or alternately $serverhost) is much more descriptive.
> Multi-server support still confuses me. I'm still really unsure what
> precise behaviour people want.
Same here. I don't see how you can make multi-server support work
without a shared filesystem (with working locks, i.e. not NFS) for
$PBS_HOME and the addition of a whole bunch of cluster membership and
voting code to pbs_server. It seems like overkill if what we're really
looking for is simply a high-availability pbs_server.
What exactly are people looking for WRT multi-server support in TORQUE,
Troy Baer troy at osc.edu
Science & Technology Support http://www.osc.edu/hpc/
Ohio Supercomputer Center 614-292-9701
torqueusers mailing list
torqueusers at supercluster.org
More information about the torqueusers