[torqueusers] Server not talking to MOMs at all
garrick at usc.edu
Mon Aug 15 15:49:21 MDT 2005
On Mon, Aug 15, 2005 at 05:11:19PM -0400, Prakash Velayutham alleged:
> Garrick Staples wrote:
> >On Mon, Aug 15, 2005 at 03:24:41PM -0400, Prakash Velayutham alleged:
> >>Here is the output of momctl -d 4 -h yy.yy.yy.yy (on the compute node):
> >Does this work from the server? Anything interesting in server's log
> Hi Garrick,
> This is what I get from the server
> Host: xylose/xylose.dmzcluster.cchmc.org Server: fructose Version:
> HomeDirectory: /var/spool/torque/mom_priv
> MOM active: 6223 seconds
> WARNING: no messages received from server
> Last Msg To Server: 20 seconds
> Server Update Interval: 20 seconds
> WARNING: no hello/cluster-addrs messages received from server
> Init Msgs Sent: 624 hellos
> LOGLEVEL: 0 (use SIGUSR1/SIGUSR2 to adjust)
> Communication Model: RPP
> TCP Timeout: 20 seconds
> Prolog Alarm Time: 300 seconds
> Alarm Time: 0 of 10 seconds
> Trusted Client List: 192.168.1.254,22.214.171.124,192.168.1.51,127.0.0.1
> JobList: NONE
> diagnostics complete
> Nothing strange at all in the server logs.
From that Trusted Client List, I'm making the following assumptions:
xylose's IP is 192.168.1.51
fructose has two interfaces: 192.168.1.254 and 126.96.36.199.
xylose doesn't have access to your "live" 205.142 network and you intend for
all cluster traffic to be on the 192.168 network.
Verify that the first $clienthost in your mom config resolves to 192.168.1.254
with matching forward and reverse.
Also verify that the names in $PBSHOME/server_priv/nodes resolves to
192.168.1.51 with matching forward and reverse.
Either MOM is sending HELLOs to the wrong IP, the HELLOs are blocked in some
port filtering or firewalling, or server is sending INIT messages back to the
wrong place. My suspicion is the first possibility.
Crank the loglevels on mom and server all the way up to 9. MOM will log where
it is sending HELLOs, server will log who it got HELLOs from.
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050815/e57ba2c1/attachment-0001.bin
More information about the torqueusers