[torqueusers] Server not talking to MOMs at all

Garrick Staples garrick at usc.edu
Thu Sep 1 15:16:43 MDT 2005


On Thu, Sep 01, 2005 at 04:50:23PM -0400, Troy Baer alleged:
> On Mon, 2005-08-15 at 16:14 -0700, Garrick Staples wrote:
> > The first $clienthost listed identifies the "server" to the MOM.  It is the
> > only hostname that will receive status updates from the MOM.
> 
> I would argue that this behavior is somewhere between counter-intuitive
> and broken, even if it has been in PBS since the beginning of time. :)
 
Agreed.  The word "client" in that parameter confused me in the
beginning; I tend to think of pbs_server as the "server" and MOMs as
"clients".


> It seems to me that the most expeditious solution to this would be to
> make pbs_mom behave in a manner symmetric with pbs_server and the client
> programs, i.e. use $PBS_DEFAULT as the server host[:port] if it's set,
> or the contents of $PBS_HOME/server_name if it's not.  Then you can use
> your favorite failover or virtualization scheme to move that IP address
> between hosts for high availability purposes.
 
But there is nothing preventing you from moving the server's IP right
now.  

And some people want multiple servers at the same time; which your
solution would prevent.  Perhaps that implies a $serverhost config (the
basl scheduler has this).


And there are other issues with multiple servers config'd in MOM:

If you intend to have 1 primary "hot" server, and 1 backup "cold"
server, then MOM will waste a whole lot of time talking to a server that
isn't running.

With a backup "hot" server, how does it get the primary's state?

I'd like to see pbs_server push more configs to MOMs, how is that
handled with multiple servers?

If the idea is to have nodes in multiple clusters at the same time, how
do you enforce policies like "jobs per node"?


> I'm going to be out for the next few days, but I may try to crank out a
> patch for this when I get back next week.

Personally, I've been avoiding this issue, because every time I think
about multi-server support I get completely lost on the specifics.

I was thinking of talking to people at the SC05 BOF before any more code
changes.  Perhaps the only sensible decisions are in the context of
moab.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050901/29211531/attachment-0001.bin


More information about the torqueusers mailing list