[torqueusers] Re: Disappearing Nodes

gianfranco sciacca gs at hep.ucl.ac.uk
Thu Mar 31 11:10:32 MST 2005


ok, got there finally, I was sorely missing port 15001/udp from the lot.
I had all the rest OK. That's the set that does it for me on both
machines of the test setup (anything other port -with few dedicated
exception- is closed in my system):

-A RH-Firewall-1-INPUT -p udp -m udp --dport 15001 -m state --state NEW
-j ACCEPT
-A RH-Firewall-1-INPUT -p tcp -m tcp --dport 15001 -m state --state NEW
-j ACCEPT
-A RH-Firewall-1-INPUT -p tcp -m tcp --dport 15004 -m state --state NEW
-j ACCEPT
-A RH-Firewall-1-INPUT -p tcp -m tcp --dport 15003 -m state --state NEW
-j ACCEPT
-A RH-Firewall-1-INPUT -p udp -m udp --dport 15003 -m state --state NEW
-j ACCEPT
-A RH-Firewall-1-INPUT -p tcp -m tcp --dport 15002 -m state --state NEW
-j ACCEPT

Didn't need 1023 or anything else. But thanks Garrick for pointing me in
the right direction. I was convinced I had it right before your
suggestion.

cheers, gianfranco
 
On Wed, 2005-03-30 at 17:29, Garrick Staples wrote:
> On Wed, Mar 30, 2005 at 04:53:02PM +0100, gianfranco sciacca alleged:
> > On Thu, 2005-03-24 at 06:50, Hannu V??is??nen wrote:
> > > On Wed, Mar 23, 2005 at 10:04:14AM -0500, Jeremy Stout wrote:
> > > > Hello. Over the weeknd, I noticed that the nodes on my cluster would
> > > > disappear and come back every few minutes. When they would disappear,
> > > > the status would often appear as "down".
> > > 
> > > This may have something to do with firewalls.
> > > My nodes disappear soon after I disable the firewall, and then
> > > pbs_mom log shows something like this
> > > 
> > > pbs_mom;Svr;pbs_mom;No child processes (10) in is_update_stat, cannot specify protocol version
> > > pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr xxx.xxx.xxx.xxx:15001
> > >                                                   Server node->===============
> > > 
> > > Port 15001 is enabled in firewall.
> [...awkward top-posting edited..]
> > 
> > In my case ports 15001 to 15004 are open in the firewall on both
> > machines of my test cluster. Indeed, the node is assigned jobs and
> > executes them, provided it is marked free manually. It then returns to
> > the down state.
> > 
> > Searching the archives, I've seen the issue "no hello/cluster-addrs
> > messages received from server" (which I get probing the node with
> > momctl) mentioned a few times, but a possible solution was never
> > mentioned.
> 
> I'm pretty all of those cases were caused by net filtering.
> 
>  
> > How to get round this? I should probably mention that I've followed by
> > the numbers the quickstart guide. In addition I have configured the
> > server and adjusted the firewall as mentioned above. There seems to be
> > an additional step to get started?
> 
> I suspect you also need ports 512 through 1024 open.  But I'd just disabled all
> net filtering between your nodes and torque server.  There's a



More information about the torqueusers mailing list