[torqueusers] Nodes to long listed as down

Julian Hagenauer chaosbringer at gmx.de
Thu Nov 2 02:15:48 MST 2006


On Thu, 2 Nov 2006 08:30:31 +0100
Julian Hagenauer <chaosbringer at gmx.de> wrote:

> On Thu, 2 Nov 2006 08:22:08 +0100
> Julian Hagenauer <chaosbringer at gmx.de> wrote:
> 
> > On Wed, 1 Nov 2006 13:58:05 -0700
> > Garrick Staples <garrick at clusterresources.com> wrote:
> > 
> > > On Tue, Oct 31, 2006 at 12:41:54PM +0100, Julian Hagenauer alleged:
> > > > Hi,
> > > > i have a very strange setup :-)
> > > > I have two identical servers both running a torque-server and a
> > torque-scheduler, and only one node running the mom.
> > > > There is only one server at a time accesible, but it gets swapped
> > periodically by the other server.
> > > > You can think of it like that:
> > > > 
> > > > Server1----|
> > > > 	   |-----------Node
> > > > 
> > > > Server2----
> > > > 
> > > > The servers get switched dynamically while both are running.
> > > > If Server1 is booted (and accessible) it takes about 15 seconds till
> > the node gets marked as free.
> > > > If i dynamically switch to Server2 after some time it takes about
> > 3:15 minutes till the node gets marked as free.
> > > > That is far to long for my case, i want the node to be recognized as
> > free as soon as possible...
> > > > I have looked through the configurations, but did not find anything
> > suitable.
> > > > I have set server node_ping_rate to 5 and tested several
> > node_check_rates without any change in behaviour.
> > > > On node-side i have set $status_update_time to 5 seconds, but it is
> > still not recognized as free earlier.
> > > > 
> > > > What i am missing?
> > > 
> > > Arp cache on the node?
> > > 
> > > We don't really support such configurations right now, though some HA
> > > plans are on the table.
> > 
> > Hi,
> > yes, Server1, Server2 and the node are virtual machines, and the virtual
> > machine monitor has an arp cache enabled, so that packets get routed
> > correctly.
> > What are HA plans? Is there a way around that, e.g. manipulating the
> > arp-table or something?
> > 
> > Thank you,
> > Julian
> 
> Sorry, i meant arp-proxy, not arp-cache... but maybe these terms mean anyway the same....
> Sincerely,
> Julian
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 

Hi,
i found out more:
If the mom-logs on the worker-node show this:
11/02/2006 10:11:33;0002;   pbs_mom;Svr;im_eof;Premature end of message from addr 192.168.1.4:15001
11/02/2006 10:11:40;0002;   pbs_mom;n/a;mom_main;hello sent to server head.chaosbringer.de
11/02/2006 10:11:41;0002;   pbs_mom;Svr;im_eof;End of File from addr 192.168.1.4:15001
than from this moment on the node is listed on 192.168.1.4 (host with server and scheduler) as free and communication seems to work.
What does this message mean?
I hope i am on the right way to solve this problem.

Thanks,
Julian

PS:
I checked the arp-caches. They seem not to contain false information.


More information about the torqueusers mailing list