[torqueusers] Server not talking to MOMs at all

Stewart.Samuels at sanofi-aventis.com Stewart.Samuels at sanofi-aventis.com
Wed Sep 7 09:44:44 MDT 2005



-----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org]On Behalf Of Troy Baer
Sent: Wednesday, September 07, 2005 10:01 AM
To: Garrick Staples
Cc: torqueusers at supercluster.org
Subject: Re: [torqueusers] Server not talking to MOMs at all


On Sat, 2005-09-03 at 15:31 -0700, Garrick Staples wrote:
> On Sat, Sep 03, 2005 at 04:22:00PM -0600, Dave Jackson alleged:
> > Garrick,
> > > > The first $clienthost listed identifies the "server" to the MOM.  It
> > is the
> > > > > only hostname that will receive status updates from the MOM.
> > > > 
> > > > I would argue that this behavior is somewhere between counter-intuitive
> > > > and broken, even if it has been in PBS since the beginning of time. :)
> >  
> >   We would like to add a 'synonym parameter to $clienthost called
> > '$headnode'.  It would behave exactly like $clienthost except for the
> > 'confusing both old and new alike' part.
> > 
> >   Is this the best name?  Thoughts?
> 
> I just noticed that '$pbsserver' is already a synonym for '$clienthost'.
> I don't know how long it's been there, but it looks "post OpenPBS" to
> me.  I suppose that is as suitable as '$headnode'; and someone,
> somewhere, already has '$pbsserver' in their config files.

The problem IMHO is that the pbs_mom code has a single array called
pbs_servername[] that appears to be an ACL of hosts allowed to talk to
the pbs_mom daemon, including both the server(s) *AND* all the other
moms without differentiating between the two (except that the very first
one is special).  I would argue that the correct solution to this is to
add a second ACL called pbs_clientname[] that's the list of moms the
local mom daemon is allowed to talk to.  That requires $clienthost and
$pbsserver to do different things when the mom config file is parsed,
but it makes multi-server easier insofar as it's now kosher for pbs_mom
to send utilization data to every host listed in pbs_servername[].  (I
would also argue that the contents of $PBS_HOME/server_name needs to be
pbs_servername[0].)

BTW, the $headnode terminology IMHO doesn't make sense in an environment
where the system running pbs_server (and likely pbs_sched/maui/moab as
well) is *NOT* a login node, although I'm not sure if that describes
anyone's system structure other than ours.  In any case, I would argue
that $pbsserver (or alternately $serverhost) is much more descriptive.

> Multi-server support still confuses me.  I'm still really unsure what
> precise behaviour people want.

Same here.  I don't see how you can make multi-server support work
without a shared filesystem (with working locks, i.e. not NFS) for
$PBS_HOME and the addition of a whole bunch of cluster membership and
voting code to pbs_server.  It seems like overkill if what we're really
looking for is simply a high-availability pbs_server.

What exactly are people looking for WRT multi-server support in TORQUE,
anyway?

In my case, I would like at least two servers in an active/active HA mode that can be accessed by the user through a load balancing mechanism to provide a robust basis for a virtualized computing environment.  That is, the user interfaces with a single hostname representing a server.  The user does not need to know that the server is really a cluster, only that it is a compute resource capable of executing their jobs.  At the same time, as the cluster grows in size (and expense), I clearly want this system available to the users as much as possible and to maintain state.  Obviously, one head node is a single point of failure to the system.  So, multiple nodes helps to aliviate this.

I realize there are several ways of failing over systems.  But without multi-server support, how do the servers retain the state of the system (running and queued jobs) when head nodes are failed over using active/passive methods?  These are some of the questions I am currently investigating and would welcome any solutions the community has successfully implemented.

	Stewart


More information about the torqueusers mailing list