[torqueusers] Problem with PBS server

Garrick Staples garrick at clusterresources.com
Tue Sep 19 13:22:18 MDT 2006


On Tue, Sep 19, 2006 at 03:48:08PM +0200, Guenter Bachler alleged:
> Dear all,
> 
> does anybody know what could have caused the 'Connection refused' 
> messages given in the
> server_logs section below. Any job submitted to the batch queue stucks 
> with status 'Q' instead 'R'
> 
> Please note that
> - pbs_server, pbs_sched and pbs_mom are installed on the same computer 
> (sgsl001)
> - there is no firewall running on the system (OS = SLES 9)

Verify that pbs_sched is actually running?  Restart it?


 
> Section of the daily server_log:
> ...
> 09/19/2006 15:18:43;0086;PBS_Server;Svr;PBS_Server;Recovered queue batch
> 09/19/2006 15:18:43;0002;PBS_Server;Svr;PBS_Server;Expected 1, recovered 
> 1 queues
> 09/19/2006 15:18:43;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 
> 0 jobs
> 09/19/2006 15:18:43;0006;PBS_Server;Svr;PBS_Server;Using ports 
> Server:15001  Scheduler:15004  MOM:15002
> 09/19/2006 15:18:43;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 
> 4129, loglevel=0
> 09/19/2006 15:18:43;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact 
> node sgsl001
> 09/19/2006 15:18:43;0001;PBS_Server;Svr;PBS_Server;Connection refused 
> (111) in contact_sched, Could not contact Scheduler - port 15004
> 09/19/2006 15:18:48;0040;PBS_Server;Req;ping_nodes;starting
> 09/19/2006 15:18:48;0040;PBS_Server;Req;ping_nodes;ping attempting to 
> contact 1 nodes
> 
> 09/19/2006 15:18:48;0040;PBS_Server;Req;update_node_state;adjusting 
> state for node sgsl001 - state=514, newstate=2
> 09/19/2006 15:18:48;0040;PBS_Server;Req;ping_nodes;sending ping to 
> sgsl001 (new stream 0)
> 09/19/2006 15:18:48;0040;PBS_Server;Req;ping_nodes;successful ping to 
> node sgsl001 (stream 0)
> 09/19/2006 15:18:49;0040;PBS_Server;Req;do_rpp;rpp request received on 
> stream 0
> 
> 09/19/2006 15:18:49;0040;PBS_Server;Req;do_rpp;corrupt rpp request 
> received on stream 0 (invalid protocol)
> 
> 09/19/2006 15:18:49;0001;PBS_Server;Svr;PBS_Server;stream_eof, 
> connection to sgsl001 is bad, remote service may be down, message may be 
> corrupt, or connection may have been dropped remotely (End of File).  
> setting node state to down
> ...
> 
> Thanks in advance
> G. Bachler
> 
> -- 
> -----
> Dr. G?nter Bachler	
> Technical Expert System Administration CAE	
> Product Development IT	
> Corporate Functions / Information Technology
> 
> e-mail: guenter.bachler at avl.com
> Phone:  +43-316-787-3425
> Fax:    +43-316-787-1847
> 
> AVL LIST GMBH
> A-8020 Graz, Hans-List-Platz 1
> http://www.avl.com
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list