[torqueusers] Re: More pbs_mom communication problems
hvaisane at joyx.joensuu.fi
Tue Mar 1 03:28:04 MST 2005
On Tue, Mar 01, 2005 at 07:52:41AM +1100, Chris Samuel wrote:
> On Mon, 28 Feb 2005 10:00 pm, Hannu Väisänen wrote:
> > When I do
> > telnet server 15001
> > on the node I get No route to host.
> So something somewhere is blocking the packets.
It was a stupid firewall error. Probably. I have
-A RH-Firewall-1-INPUT -j REJECT --reject-with icmp-host-prohibited
but that was not the last rule in /sys/config/iptables
Now, When I do on the node
telnet server 15001
Connected to server.
Escape character is '^]'.
And then it hangs. When I interupt it with Control-D, it says
+2+15+15056+0+72+26 MSG=cannot decode messageConnection closed by foreign host.
On server log there is
PBS_Server;Req;req_reject;Reject reply code=15056( MSG=cannot decode message), aux=0, type=0, from @
So I think now there is a route to server.
Everything seemed OK for a while. pbsnodes -a listed both nodes as
free. (I have two nodes, one on the same machine than the server and
the other on another machine.)
Then this appeared on the other machine's mom log
pbs_mom;Svr;pbs_mom;No child processes (10) in is_update_stat, cannot specify protocol
pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr 126.96.36.199:15001
It comes every 8 or 9 minutes.
now pbsnodes -a again lists the node as down.
> How many interfaces do you have ?
> Doesn't sound to me like a Torque/PBS issue at all.
It is probably a firewall/networks issue.
So, where can I find Idiot's Guide to Firewalls and Networks? (-:
> Good luck!
Thanks, Cris, and everybody else helping me!
More information about the torqueusers