[torqueusers] Server not talking to MOMs at all
Prakash Velayutham
velayups at email.uc.edu
Mon Aug 15 13:33:17 MDT 2005
Prakash Velayutham wrote:
> Hi All,
>
> I have just a 1 node + 1 server system. The server on the server
> system starts up just fine and the MOM starts up on the compute node
> just fine. But there is no communication between the 2. The strange
> thing is that the 2 were talking temporarily for almost half a day
> sometime last wednesday. When I restarted the MOM and server (for
> adding more nodes), all the nodes now show up as state-unknown,down in
> "pbsnodes". Even after I removed the newly added nodes, things are not
> going back to normal.
>
> Here is the output of momctl -d 4 -h yy.yy.yy.yy (on the compute node):
>
> Host: xylose/xylose.dmzcluster.cchmc.org Server: fructose Version:
> torque_1.2.0p5
> HomeDirectory: /var/spool/torque/mom_priv
> MOM active: 15756 seconds
> WARNING: no messages received from server
> Last Msg To Server: 0 seconds
> Server Update Interval: 20 seconds
> WARNING: no hello/cluster-addrs messages received from server
> Init Msgs Sent: 1581 hellos
> LOGLEVEL: 0 (use SIGUSR1/SIGUSR2 to adjust)
> Communication Model: RPP
> TCP Timeout: 20 seconds
> Prolog Alarm Time: 300 seconds
> Alarm Time: 0 of 10 seconds
> Trusted Client List:
> 192.168.1.254,205.142.199.176,192.168.1.51,127.0.0.1
> JobList: NONE
>
> diagnostics complete
>
> When server server daemon starts also I don't see it to find that the
> node is available. It just does not see it.
> As a sidenote, I noticed someone in the list saying that netfilter
> iptables might cause this. I have Masquerading set on the server. So
> would it affect this?
>
> Any help greatly appreciated.
>
> Thanks,
> Prakash
Sorry, forgot to mention that the MOM complains once in a while as follows:
pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr x.x.x.x:15001
Prakash
More information about the torqueusers
mailing list