[torqueusers] Disappearing Nodes

Troy Baer troy at osc.edu
Wed Mar 23 11:29:27 MST 2005


On Wed, 2005-03-23 at 10:04, Jeremy Stout wrote:
> Hello. Over the weeknd, I noticed that the nodes on my cluster would
> disappear and come back every few minutes. When they would disappear,
> the status would often appear as "down". I've looked at the server/
> mom logs and have not been able to figure out what the problem is. I
> wiped out my existing torque installation and reinstalled everything
> this morning (as per the Torque Quick Installation Guide).
> Unfortunately, I am still getting the same error messages. Any help
> would be appreciated.
>
> Here is a brief summary of the error messages I'm seeing:
> pbs_server:
> 03/23/2005 09:41:01;0086;PBS_Server;Svr;PBS_Server;Starting to
> shutdown the server, type is By Signal
> 03/23/2005 09:41:06;0002;PBS_Server;Svr;Log;Log opened
> 03/23/2005 09:41:06;0006;PBS_Server;Svr;PBS_Server;Server energy
> started, initialization type = 1
> 03/23/2005 09:41:06;0002;PBS_Server;Svr;Act;Account file
> /usr/spool/PBS/server_priv/accounting/20050323 opened
> 03/23/2005 09:41:06;0040;PBS_Server;Req;setup_nodes;setup_nodes()
> 03/23/2005 09:41:06;0086;PBS_Server;Svr;PBS_Server;Recovered queue batch
> 03/23/2005 09:41:06;0002;PBS_Server;Svr;PBS_Server;Expected 1,
> recovered 1 queues
> 03/23/2005 09:41:06;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 jobs
> 03/23/2005 09:41:06;0006;PBS_Server;Svr;PBS_Server;Using ports
> Server:15001  Scheduler:15004  MOM:15002
> 03/23/2005 09:41:06;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 13231
> 03/23/2005 09:41:06;0001;PBS_Server;Svr;PBS_Server;Connection refused
> (111) in contact_sched, Could not contact Scheduler - port 15004
> 03/23/2005 09:51:01;0086;PBS_Server;Svr;PBS_Server;Starting to
> shutdown the server, type is By Signal
> 03/23/2005 09:51:05;0002;PBS_Server;Svr;Log;Log opened
> 03/23/2005 09:51:05;0006;PBS_Server;Svr;PBS_Server;Server energy
> started, initialization type = 1
> 03/23/2005 09:51:05;0002;PBS_Server;Svr;Act;Account file
> /usr/spool/PBS/server_priv/accounting/20050323 opened
> 03/23/2005 09:51:05;0040;PBS_Server;Req;setup_nodes;setup_nodes()
> 
> 03/23/2005 09:51:05;0086;PBS_Server;Svr;PBS_Server;Recovered queue batch
> 03/23/2005 09:51:05;0002;PBS_Server;Svr;PBS_Server;Expected 1,
> recovered 1 queues
> 03/23/2005 09:51:05;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 jobs
> 03/23/2005 09:51:05;0006;PBS_Server;Svr;PBS_Server;Using ports
> Server:15001  Scheduler:15004  MOM:15002
> 03/23/2005 09:51:05;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 13439
> 03/23/2005 09:51:05;0001;PBS_Server;Svr;PBS_Server;Connection refused
> (111) in contact_sched, Could not contact Scheduler - port 15004
> 
> pbs_mom:
> 03/23/2005 09:26:01;0002;   pbs_mom;n/a;is_update_stat;hello sent to server
> 03/23/2005 09:26:01;0002;   pbs_mom;n/a;is_update_stat;hello sent to server
> 03/23/2005 09:31:00;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File
> from addr 10.10.10.1:15001
> 03/23/2005 09:41:01;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File
> from addr 10.10.10.1:15001
> 03/23/2005 09:49:33;0001;   pbs_mom;Svr;pbs_mom;im_eof, Premature end
> of message from addr 10.10.10.1:15001
> 03/23/2005 09:51:01;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File
> from addr 10.10.10.1:15001
> 03/23/2005 10:01:01;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File
> from addr 10.10.10.1:15001

There's a couple things this could be.  First off, how do you have your
$PBS_HOME/mom_priv/config files set up?  You need to have $clienthost
entries in it for any host which will communicate with the mom from a
privileged port -- which includes the pbs_server host, the host running
your scheduler (pbs_sched, maui, moab, etc.), and all the compute
nodes.  You may need to add $restricted entries for all these host as
well.

Second, do you have a scheduler daemon running?  If not, that would
explain the "Could not contact Scheduler" messages in the pbs_server
logs.

	--Troy
-- 
Troy Baer                       email:  troy at osc.edu
Science & Technology Support    phone:  614-292-9701
Ohio Supercomputer Center       web:  http://oscinfo.osc.edu





More information about the torqueusers mailing list