[torqueusers] Server not talking to MOMs at all

Troy Baer troy at osc.edu
Wed Sep 7 08:01:18 MDT 2005


On Sat, 2005-09-03 at 15:31 -0700, Garrick Staples wrote:
> On Sat, Sep 03, 2005 at 04:22:00PM -0600, Dave Jackson alleged:
> > Garrick,
> > > > The first $clienthost listed identifies the "server" to the MOM.  It
> > is the
> > > > > only hostname that will receive status updates from the MOM.
> > > > 
> > > > I would argue that this behavior is somewhere between counter-intuitive
> > > > and broken, even if it has been in PBS since the beginning of time. :)
> >  
> >   We would like to add a 'synonym parameter to $clienthost called
> > '$headnode'.  It would behave exactly like $clienthost except for the
> > 'confusing both old and new alike' part.
> > 
> >   Is this the best name?  Thoughts?
> 
> I just noticed that '$pbsserver' is already a synonym for '$clienthost'.
> I don't know how long it's been there, but it looks "post OpenPBS" to
> me.  I suppose that is as suitable as '$headnode'; and someone,
> somewhere, already has '$pbsserver' in their config files.

The problem IMHO is that the pbs_mom code has a single array called
pbs_servername[] that appears to be an ACL of hosts allowed to talk to
the pbs_mom daemon, including both the server(s) *AND* all the other
moms without differentiating between the two (except that the very first
one is special).  I would argue that the correct solution to this is to
add a second ACL called pbs_clientname[] that's the list of moms the
local mom daemon is allowed to talk to.  That requires $clienthost and
$pbsserver to do different things when the mom config file is parsed,
but it makes multi-server easier insofar as it's now kosher for pbs_mom
to send utilization data to every host listed in pbs_servername[].  (I
would also argue that the contents of $PBS_HOME/server_name needs to be
pbs_servername[0].)

BTW, the $headnode terminology IMHO doesn't make sense in an environment
where the system running pbs_server (and likely pbs_sched/maui/moab as
well) is *NOT* a login node, although I'm not sure if that describes
anyone's system structure other than ours.  In any case, I would argue
that $pbsserver (or alternately $serverhost) is much more descriptive.

> Multi-server support still confuses me.  I'm still really unsure what
> precise behaviour people want.

Same here.  I don't see how you can make multi-server support work
without a shared filesystem (with working locks, i.e. not NFS) for
$PBS_HOME and the addition of a whole bunch of cluster membership and
voting code to pbs_server.  It seems like overkill if what we're really
looking for is simply a high-availability pbs_server.

What exactly are people looking for WRT multi-server support in TORQUE,
anyway?

	--Troy
-- 
Troy Baer                       troy at osc.edu
Science & Technology Support    http://www.osc.edu/hpc/
Ohio Supercomputer Center       614-292-9701



More information about the torqueusers mailing list