[torqueusers] Server not talking to MOMs at all

Prakash Velayutham velayups at email.uc.edu
Mon Aug 15 13:33:17 MDT 2005

Prakash Velayutham wrote:

> Hi All,
> I have just a 1 node + 1 server system. The server on the server 
> system starts up just fine and the MOM starts up on the compute node 
> just fine. But there is no communication between the 2. The strange 
> thing is that the 2 were talking temporarily for almost half a day 
> sometime last wednesday. When I restarted the MOM and server (for 
> adding more nodes), all the nodes now show up as state-unknown,down in 
> "pbsnodes". Even after I removed the newly added nodes, things are not 
> going back to normal.
> Here is the output of momctl -d 4 -h yy.yy.yy.yy (on the compute node):
> Host: xylose/xylose.dmzcluster.cchmc.org   Server: fructose   Version: 
> torque_1.2.0p5
> HomeDirectory:          /var/spool/torque/mom_priv
> MOM active:             15756 seconds
> WARNING:  no messages received from server
> Last Msg To Server:     0 seconds
> Server Update Interval: 20 seconds
> WARNING:  no hello/cluster-addrs messages received from server
> Init Msgs Sent:         1581 hellos
> LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
> Communication Model:    RPP
> TCP Timeout:            20 seconds
> Prolog Alarm Time:      300 seconds
> Alarm Time:             0 of 10 seconds
> Trusted Client List:    
> JobList:                NONE
> diagnostics complete
> When server server daemon starts also I don't see it to find that the 
> node is available. It just does not see it.
> As a sidenote, I noticed someone in the list saying that netfilter 
> iptables might cause this. I have Masquerading set on the server. So 
> would it affect this?
> Any help greatly appreciated.
> Thanks,
> Prakash

Sorry, forgot to mention that the MOM complains once in a while as follows:
pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr x.x.x.x:15001


More information about the torqueusers mailing list