[torqueusers] Re: More pbs_mom communication problems

Hannu Väisänen hvaisane at joyx.joensuu.fi
Tue Mar 1 03:28:04 MST 2005

On Tue, Mar 01, 2005 at 07:52:41AM +1100, Chris Samuel wrote:
> On Mon, 28 Feb 2005 10:00 pm, Hannu Väisänen wrote:
> > When I do
> > telnet server 15001
> > on the node I get No route to host.
> So something somewhere is blocking the packets.

It was a stupid firewall error. Probably. I have

-A RH-Firewall-1-INPUT -j REJECT --reject-with icmp-host-prohibited

but that was not the last rule in /sys/config/iptables

Now, When I do on the node

telnet server 15001

it says

Trying nnn.nnn.nnn.nnn...
Connected to server.
Escape character is '^]'.

And then it hangs. When I interupt it with Control-D, it says

+2+15+15056+0+72+26 MSG=cannot decode messageConnection closed by foreign host.

On server log there is

PBS_Server;Req;req_reject;Reject reply code=15056( MSG=cannot decode message), aux=0, type=0, from @

So I think now there is a route to server.

Everything seemed OK for a while. pbsnodes -a listed both nodes as
free. (I have two nodes, one on the same machine than the server and
the other on another machine.)

Then this appeared on the other machine's mom log

pbs_mom;Svr;pbs_mom;No child processes (10) in is_update_stat, cannot specify protocol
pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr

It comes every 8 or 9 minutes.

now pbsnodes -a again lists the node as down.

> How many interfaces do you have ?

Only one.

> Doesn't sound to me like a Torque/PBS issue at all.

It is probably a firewall/networks issue.

So, where can I find Idiot's Guide to Firewalls and Networks? (-:

> Good luck!

Thanks, Cris, and everybody else helping me!

