[torqueusers] Nodes to long listed as down

Julian Hagenauer chaosbringer at gmx.de
Thu Nov 2 06:14:00 MST 2006


On Thu, 2 Nov 2006 10:45:01 +0100
Julian Hagenauer <chaosbringer at gmx.de> wrote:

> 
> > Hi,
> > i found out more:
> > If the mom-logs on the worker-node show this:
> > 11/02/2006 10:11:33;0002;   pbs_mom;Svr;im_eof;Premature end of message from addr 192.168.1.4:15001
> > 11/02/2006 10:11:40;0002;   pbs_mom;n/a;mom_main;hello sent to server head.chaosbringer.de
> > 11/02/2006 10:11:41;0002;   pbs_mom;Svr;im_eof;End of File from addr 192.168.1.4:15001
> > than from this moment on the node is listed on 192.168.1.4 (host with server and scheduler) as free and communication seems to work.
> > What does this message mean?
> > I hope i am on the right way to solve this problem.
> > 
> > Thanks,
> > Julian
> >
> 
> And again... With loglevel set to 7, the mom-logs show this:
> 
> 11/02/2006 10:31:52;0002;   pbs_mom;Svr;im_eof;Premature end of message from addr 192.168.1.4:15001
> 11/02/2006 10:31:53;0002;   pbs_mom;n/a;init_server_stream;init_server_stream: trying to open RPP conn to head.chaosbringer.de port 15001
> 11/02/2006 10:31:53;0002;   pbs_mom;n/a;init_server_stream;init_server_stream: added connection to head.chaosbringer.de port 15001
> 11/02/2006 10:31:59;0002;   pbs_mom;n/a;mom_main;hello sent to server head.chaosbringer.de
> 
> head is the host 192.168.1.4 on which the torque server and scheduler runs.
> 
> Thanks,
> Julian

I looked again over it.
pbs_mom is send roughly 3 minutes long status-updates to the server. The server itself does not respond. If these 3 minutes passed, pbs_mom seems to start a new rpp con (see above) and than everything works nicely.
What are those three minutes? a timeout? why does pbs_mom do not recognize, that there are no replies from the server?

Thank you,
Julian


More information about the torqueusers mailing list