[torqueusers] Nodes down for a long time after server restart.
Roy Dragseth
Roy.Dragseth at cc.uit.no
Wed Oct 26 14:01:40 MDT 2005
Hi.
OS: CentOS 4.2
Torque: 2.0.0p0
Setup: one frontend, one compute node.
After doing a server restart the node never comes back up again, pbs_server
says something about protocol error, and pbs_mom never says anything except
eof from server.
pbs server log:
10/26/2005 21:25:56;0002;PBS_Server;Svr;Log;Log opened
10/26/2005 21:25:56;0006;PBS_Server;Svr;PBS_Server;Server
toorah.student.uit.no started, initializati
on type = 1
10/26/2005 21:25:56;0002;PBS_Server;Svr;Act;Account
file /opt/torque/server_priv/accounting/20051026
opened
10/26/2005 21:25:56;0040;PBS_Server;Req;setup_nodes;setup_nodes()
10/26/2005 21:25:56;0086;PBS_Server;Svr;PBS_Server;Recovered queue default
10/26/2005 21:25:56;0002;PBS_Server;Svr;PBS_Server;Expected 1, recovered 1
queues
10/26/2005 21:25:56;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0
jobs
10/26/2005 21:25:56;0006;PBS_Server;Svr;PBS_Server;Using ports Server:15001
Scheduler:15004 MOM:150
02
10/26/2005 21:25:56;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 31539,
loglevel=0
10/26/2005 21:25:56;0040;PBS_Server;Req;ping_nodes;ping attempting to contact
1 nodes
10/26/2005 21:25:56;0001;PBS_Server;Svr;PBS_Server;ping_nodes, Protocol
failure in commit 9 to comput
e-0-1.local(10.255.255.253:15003)
pbs mom log:
10/26/2005 21:25:44;0002; pbs_mom;node;im_eof;End of File from addr
10.1.1.1:15001
10/26/2005 21:25:44;0002; pbs_mom;n/a;mom_main;hello sent to server
any hints?
r.
--
The Computer Center, University of Tromsø, N-9037 TROMSØ, Norway.
phone:+47 77 64 41 07, fax:+47 77 64 41 00
Roy Dragseth, High Performance Computing System Administrator
Direct call: +47 77 64 62 56. email: royd at cc.uit.no
More information about the torqueusers
mailing list