[torqueusers] Re: LAM-MPI won't boot with torque-1.2.0p6

Stewart Samuels Stewart.Samuels at sanofi-aventis.com
Thu Sep 15 12:15:32 MDT 2005

On Thu, 2005-09-15 at 14:34, Troy Baer wrote:
> On Thu, 2005-09-15 at 10:45 -0700, Garrick Staples wrote:
> > On Thu, Sep 15, 2005 at 01:10:56PM -0400, Troy Baer alleged:
> > > Do you have $clienthost entries in $PBS_HOME/mom_priv/config for all of
> > > your compute nodes?  If not, I suspect that's your problem, as pbs_mom
> > > needs a $clienthost entry for every host that's allowed to talk to it,
> > > server and moms.
> > 
> > No you don't.  You need entries for all non-node hosts.  The first entry
> > must be your pbs_server host, and add additional entries for other hosts
> > if you want to be able to run 'momctl' or 'dumpmom'.
> That's not what the pbs_mom manpage says, FWIW:
>     clienthost
>            which causes a host name to be added to the list of hosts
>            which will be allowed to connect to MOM as long  as  they
>            are  using  a privilaged port.  For example, here are two
>            configuration file  lines  which  will  allow  the  hosts
>            "fred" and "wilma" to connect:
>            $clienthost      fred
>            $clienthost      wilma
>            Two  host  name  are  always  allowed  to  connection  to
>            pbs_mom, "localhost" and the name returned to pbs_mom  by
>            the  system  call gethostname().  These names need not be
>            specified in the configuration file.  The hosts listed as
>            "clienthosts"  comprise  a "sisterhood" of machines.  Any
>            one of the sisterhood  will  accept  connections  from  a
>            server from within the sisterhood.  They will also accept
>            Resource Monitor (RM) requests and Internal MOM (IM) mes-
>            sages from within the sisterhood.  For a sisterhood to be
>            able to communicate IM messages to each other, they  must
>            all share the same RM port.
> > pbs_server propogates a list of all nodes to every node in your cluster.
> If this is the case (and my experiments with TORQUE just now indicate
> that it is, somewhat to my surprise), then the section of the pbs_mom
> man page cited above is wrong and needs to be corrected.
> I guess you learn something every day. :)
> 	--Troy

I have recently looked through this section of the code and Garrick's
assessment is correct.  Additionally, however, I believe there is a
section in the original PBS User Guide which describes the fact that
pbs_server actually passes the list of nodes in the cluster to all
members of that cluster, and therefore, they do not need inclusion in
the mom_priv/config file (just has Garrick has described).  The
paragraphs in the pbs_mom manpage that you mention list are taken from
the PBS User Guide as well.

What is different and makes things more confusing is that (and I can't
find it anywhere in the docs, but Garrick has mentioned it in other
threads) if multiple $clienthost entries are listed in the
mom_priv/config file, only the first $clienthost entry found in the file
is used.  This helps to reinforce that moms talk to one and only one
pbs_server.  This did not use to be the case, but as we have discovered,
to provide for high cluster scalibility, the code (but apparently not
the documentation) has been modified.


