[torqueusers] Nodes down for a long time after server restart.

Roy Dragseth Roy.Dragseth at cc.uit.no
Wed Oct 26 14:01:40 MDT 2005


Hi.

OS: CentOS 4.2
Torque: 2.0.0p0
Setup: one frontend, one compute node.
After doing a server restart the node never comes back up again, pbs_server 
says something about protocol error, and pbs_mom never says anything except 
eof from server.

pbs server log:

10/26/2005 21:25:56;0002;PBS_Server;Svr;Log;Log opened
10/26/2005 21:25:56;0006;PBS_Server;Svr;PBS_Server;Server 
toorah.student.uit.no started, initializati
on type = 1
10/26/2005 21:25:56;0002;PBS_Server;Svr;Act;Account 
file /opt/torque/server_priv/accounting/20051026
opened
10/26/2005 21:25:56;0040;PBS_Server;Req;setup_nodes;setup_nodes()

10/26/2005 21:25:56;0086;PBS_Server;Svr;PBS_Server;Recovered queue default
10/26/2005 21:25:56;0002;PBS_Server;Svr;PBS_Server;Expected 1, recovered 1 
queues
10/26/2005 21:25:56;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 
jobs
10/26/2005 21:25:56;0006;PBS_Server;Svr;PBS_Server;Using ports Server:15001  
Scheduler:15004  MOM:150
02
10/26/2005 21:25:56;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 31539, 
loglevel=0
10/26/2005 21:25:56;0040;PBS_Server;Req;ping_nodes;ping attempting to contact 
1 nodes

10/26/2005 21:25:56;0001;PBS_Server;Svr;PBS_Server;ping_nodes, Protocol 
failure in commit 9 to comput
e-0-1.local(10.255.255.253:15003)

pbs mom log:

10/26/2005 21:25:44;0002;   pbs_mom;node;im_eof;End of File from addr 
10.1.1.1:15001
10/26/2005 21:25:44;0002;   pbs_mom;n/a;mom_main;hello sent to server


any hints?

r.
-- 
  The Computer Center, University of Tromsø, N-9037 TROMSØ, Norway.
	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
     Roy Dragseth, High Performance Computing System Administrator
	 Direct call: +47 77 64 62 56. email: royd at cc.uit.no


More information about the torqueusers mailing list