[torqueusers] Re: LAM-MPI won't boot with torque-1.2.0p6
Stewart.Samuels at sanofi-aventis.com
Thu Sep 15 12:15:32 MDT 2005
On Thu, 2005-09-15 at 14:34, Troy Baer wrote:
> On Thu, 2005-09-15 at 10:45 -0700, Garrick Staples wrote:
> > On Thu, Sep 15, 2005 at 01:10:56PM -0400, Troy Baer alleged:
> > > Do you have $clienthost entries in $PBS_HOME/mom_priv/config for all of
> > > your compute nodes? If not, I suspect that's your problem, as pbs_mom
> > > needs a $clienthost entry for every host that's allowed to talk to it,
> > > server and moms.
> > No you don't. You need entries for all non-node hosts. The first entry
> > must be your pbs_server host, and add additional entries for other hosts
> > if you want to be able to run 'momctl' or 'dumpmom'.
> That's not what the pbs_mom manpage says, FWIW:
> which causes a host name to be added to the list of hosts
> which will be allowed to connect to MOM as long as they
> are using a privilaged port. For example, here are two
> configuration file lines which will allow the hosts
> "fred" and "wilma" to connect:
> $clienthost fred
> $clienthost wilma
> Two host name are always allowed to connection to
> pbs_mom, "localhost" and the name returned to pbs_mom by
> the system call gethostname(). These names need not be
> specified in the configuration file. The hosts listed as
> "clienthosts" comprise a "sisterhood" of machines. Any
> one of the sisterhood will accept connections from a
> server from within the sisterhood. They will also accept
> Resource Monitor (RM) requests and Internal MOM (IM) mes-
> sages from within the sisterhood. For a sisterhood to be
> able to communicate IM messages to each other, they must
> all share the same RM port.
> > pbs_server propogates a list of all nodes to every node in your cluster.
> If this is the case (and my experiments with TORQUE just now indicate
> that it is, somewhat to my surprise), then the section of the pbs_mom
> man page cited above is wrong and needs to be corrected.
> I guess you learn something every day. :)
I have recently looked through this section of the code and Garrick's
assessment is correct. Additionally, however, I believe there is a
section in the original PBS User Guide which describes the fact that
pbs_server actually passes the list of nodes in the cluster to all
members of that cluster, and therefore, they do not need inclusion in
the mom_priv/config file (just has Garrick has described). The
paragraphs in the pbs_mom manpage that you mention list are taken from
the PBS User Guide as well.
What is different and makes things more confusing is that (and I can't
find it anywhere in the docs, but Garrick has mentioned it in other
threads) if multiple $clienthost entries are listed in the
mom_priv/config file, only the first $clienthost entry found in the file
is used. This helps to reinforce that moms talk to one and only one
pbs_server. This did not use to be the case, but as we have discovered,
to provide for high cluster scalibility, the code (but apparently not
the documentation) has been modified.
More information about the torqueusers