[torqueusers] Disappearing Nodes

Jeremy Stout stout.jeremy at gmail.com
Wed Mar 23 08:04:14 MST 2005


Hello. Over the weeknd, I noticed that the nodes on my cluster would
disappear and come back every few minutes. When they would disappear,
the status would often appear as "down". I've looked at the server/
mom logs and have not been able to figure out what the problem is. I
wiped out my existing torque installation and reinstalled everything
this morning (as per the Torque Quick Installation Guide).
Unfortunately, I am still getting the same error messages. Any help
would be appreciated.

Here is a brief summary of the error messages I'm seeing:
pbs_server:
03/23/2005 09:41:01;0086;PBS_Server;Svr;PBS_Server;Starting to
shutdown the server, type is By Signal
03/23/2005 09:41:06;0002;PBS_Server;Svr;Log;Log opened
03/23/2005 09:41:06;0006;PBS_Server;Svr;PBS_Server;Server energy
started, initialization type = 1
03/23/2005 09:41:06;0002;PBS_Server;Svr;Act;Account file
/usr/spool/PBS/server_priv/accounting/20050323 opened
03/23/2005 09:41:06;0040;PBS_Server;Req;setup_nodes;setup_nodes()
03/23/2005 09:41:06;0086;PBS_Server;Svr;PBS_Server;Recovered queue batch
03/23/2005 09:41:06;0002;PBS_Server;Svr;PBS_Server;Expected 1,
recovered 1 queues
03/23/2005 09:41:06;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 jobs
03/23/2005 09:41:06;0006;PBS_Server;Svr;PBS_Server;Using ports
Server:15001  Scheduler:15004  MOM:15002
03/23/2005 09:41:06;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 13231
03/23/2005 09:41:06;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in contact_sched, Could not contact Scheduler - port 15004
03/23/2005 09:51:01;0086;PBS_Server;Svr;PBS_Server;Starting to
shutdown the server, type is By Signal
03/23/2005 09:51:05;0002;PBS_Server;Svr;Log;Log opened
03/23/2005 09:51:05;0006;PBS_Server;Svr;PBS_Server;Server energy
started, initialization type = 1
03/23/2005 09:51:05;0002;PBS_Server;Svr;Act;Account file
/usr/spool/PBS/server_priv/accounting/20050323 opened
03/23/2005 09:51:05;0040;PBS_Server;Req;setup_nodes;setup_nodes()

03/23/2005 09:51:05;0086;PBS_Server;Svr;PBS_Server;Recovered queue batch
03/23/2005 09:51:05;0002;PBS_Server;Svr;PBS_Server;Expected 1,
recovered 1 queues
03/23/2005 09:51:05;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 jobs
03/23/2005 09:51:05;0006;PBS_Server;Svr;PBS_Server;Using ports
Server:15001  Scheduler:15004  MOM:15002
03/23/2005 09:51:05;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 13439
03/23/2005 09:51:05;0001;PBS_Server;Svr;PBS_Server;Connection refused
(111) in contact_sched, Could not contact Scheduler - port 15004

pbs_mom:
03/23/2005 09:26:01;0002;   pbs_mom;n/a;is_update_stat;hello sent to server
03/23/2005 09:26:01;0002;   pbs_mom;n/a;is_update_stat;hello sent to server
03/23/2005 09:31:00;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File
from addr 10.10.10.1:15001
03/23/2005 09:41:01;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File
from addr 10.10.10.1:15001
03/23/2005 09:49:33;0001;   pbs_mom;Svr;pbs_mom;im_eof, Premature end
of message from addr 10.10.10.1:15001
03/23/2005 09:51:01;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File
from addr 10.10.10.1:15001
03/23/2005 10:01:01;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File
from addr 10.10.10.1:15001

Thank you.

Jeremy Stout


More information about the torqueusers mailing list