[torqueusers] Disappearing Nodes
Troy Baer
troy at osc.edu
Wed Mar 23 11:29:27 MST 2005
On Wed, 2005-03-23 at 10:04, Jeremy Stout wrote:
> Hello. Over the weeknd, I noticed that the nodes on my cluster would
> disappear and come back every few minutes. When they would disappear,
> the status would often appear as "down". I've looked at the server/
> mom logs and have not been able to figure out what the problem is. I
> wiped out my existing torque installation and reinstalled everything
> this morning (as per the Torque Quick Installation Guide).
> Unfortunately, I am still getting the same error messages. Any help
> would be appreciated.
>
> Here is a brief summary of the error messages I'm seeing:
> pbs_server:
> 03/23/2005 09:41:01;0086;PBS_Server;Svr;PBS_Server;Starting to
> shutdown the server, type is By Signal
> 03/23/2005 09:41:06;0002;PBS_Server;Svr;Log;Log opened
> 03/23/2005 09:41:06;0006;PBS_Server;Svr;PBS_Server;Server energy
> started, initialization type = 1
> 03/23/2005 09:41:06;0002;PBS_Server;Svr;Act;Account file
> /usr/spool/PBS/server_priv/accounting/20050323 opened
> 03/23/2005 09:41:06;0040;PBS_Server;Req;setup_nodes;setup_nodes()
> 03/23/2005 09:41:06;0086;PBS_Server;Svr;PBS_Server;Recovered queue batch
> 03/23/2005 09:41:06;0002;PBS_Server;Svr;PBS_Server;Expected 1,
> recovered 1 queues
> 03/23/2005 09:41:06;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 jobs
> 03/23/2005 09:41:06;0006;PBS_Server;Svr;PBS_Server;Using ports
> Server:15001 Scheduler:15004 MOM:15002
> 03/23/2005 09:41:06;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 13231
> 03/23/2005 09:41:06;0001;PBS_Server;Svr;PBS_Server;Connection refused
> (111) in contact_sched, Could not contact Scheduler - port 15004
> 03/23/2005 09:51:01;0086;PBS_Server;Svr;PBS_Server;Starting to
> shutdown the server, type is By Signal
> 03/23/2005 09:51:05;0002;PBS_Server;Svr;Log;Log opened
> 03/23/2005 09:51:05;0006;PBS_Server;Svr;PBS_Server;Server energy
> started, initialization type = 1
> 03/23/2005 09:51:05;0002;PBS_Server;Svr;Act;Account file
> /usr/spool/PBS/server_priv/accounting/20050323 opened
> 03/23/2005 09:51:05;0040;PBS_Server;Req;setup_nodes;setup_nodes()
>
> 03/23/2005 09:51:05;0086;PBS_Server;Svr;PBS_Server;Recovered queue batch
> 03/23/2005 09:51:05;0002;PBS_Server;Svr;PBS_Server;Expected 1,
> recovered 1 queues
> 03/23/2005 09:51:05;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 jobs
> 03/23/2005 09:51:05;0006;PBS_Server;Svr;PBS_Server;Using ports
> Server:15001 Scheduler:15004 MOM:15002
> 03/23/2005 09:51:05;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 13439
> 03/23/2005 09:51:05;0001;PBS_Server;Svr;PBS_Server;Connection refused
> (111) in contact_sched, Could not contact Scheduler - port 15004
>
> pbs_mom:
> 03/23/2005 09:26:01;0002; pbs_mom;n/a;is_update_stat;hello sent to server
> 03/23/2005 09:26:01;0002; pbs_mom;n/a;is_update_stat;hello sent to server
> 03/23/2005 09:31:00;0001; pbs_mom;Svr;pbs_mom;im_eof, End of File
> from addr 10.10.10.1:15001
> 03/23/2005 09:41:01;0001; pbs_mom;Svr;pbs_mom;im_eof, End of File
> from addr 10.10.10.1:15001
> 03/23/2005 09:49:33;0001; pbs_mom;Svr;pbs_mom;im_eof, Premature end
> of message from addr 10.10.10.1:15001
> 03/23/2005 09:51:01;0001; pbs_mom;Svr;pbs_mom;im_eof, End of File
> from addr 10.10.10.1:15001
> 03/23/2005 10:01:01;0001; pbs_mom;Svr;pbs_mom;im_eof, End of File
> from addr 10.10.10.1:15001
There's a couple things this could be. First off, how do you have your
$PBS_HOME/mom_priv/config files set up? You need to have $clienthost
entries in it for any host which will communicate with the mom from a
privileged port -- which includes the pbs_server host, the host running
your scheduler (pbs_sched, maui, moab, etc.), and all the compute
nodes. You may need to add $restricted entries for all these host as
well.
Second, do you have a scheduler daemon running? If not, that would
explain the "Could not contact Scheduler" messages in the pbs_server
logs.
--Troy
--
Troy Baer email: troy at osc.edu
Science & Technology Support phone: 614-292-9701
Ohio Supercomputer Center web: http://oscinfo.osc.edu
More information about the torqueusers
mailing list