[torqueusers] Problem with PBS server
Garrick Staples
garrick at clusterresources.com
Tue Sep 19 13:22:18 MDT 2006
On Tue, Sep 19, 2006 at 03:48:08PM +0200, Guenter Bachler alleged:
> Dear all,
>
> does anybody know what could have caused the 'Connection refused'
> messages given in the
> server_logs section below. Any job submitted to the batch queue stucks
> with status 'Q' instead 'R'
>
> Please note that
> - pbs_server, pbs_sched and pbs_mom are installed on the same computer
> (sgsl001)
> - there is no firewall running on the system (OS = SLES 9)
Verify that pbs_sched is actually running? Restart it?
> Section of the daily server_log:
> ...
> 09/19/2006 15:18:43;0086;PBS_Server;Svr;PBS_Server;Recovered queue batch
> 09/19/2006 15:18:43;0002;PBS_Server;Svr;PBS_Server;Expected 1, recovered
> 1 queues
> 09/19/2006 15:18:43;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered
> 0 jobs
> 09/19/2006 15:18:43;0006;PBS_Server;Svr;PBS_Server;Using ports
> Server:15001 Scheduler:15004 MOM:15002
> 09/19/2006 15:18:43;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid =
> 4129, loglevel=0
> 09/19/2006 15:18:43;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact
> node sgsl001
> 09/19/2006 15:18:43;0001;PBS_Server;Svr;PBS_Server;Connection refused
> (111) in contact_sched, Could not contact Scheduler - port 15004
> 09/19/2006 15:18:48;0040;PBS_Server;Req;ping_nodes;starting
> 09/19/2006 15:18:48;0040;PBS_Server;Req;ping_nodes;ping attempting to
> contact 1 nodes
>
> 09/19/2006 15:18:48;0040;PBS_Server;Req;update_node_state;adjusting
> state for node sgsl001 - state=514, newstate=2
> 09/19/2006 15:18:48;0040;PBS_Server;Req;ping_nodes;sending ping to
> sgsl001 (new stream 0)
> 09/19/2006 15:18:48;0040;PBS_Server;Req;ping_nodes;successful ping to
> node sgsl001 (stream 0)
> 09/19/2006 15:18:49;0040;PBS_Server;Req;do_rpp;rpp request received on
> stream 0
>
> 09/19/2006 15:18:49;0040;PBS_Server;Req;do_rpp;corrupt rpp request
> received on stream 0 (invalid protocol)
>
> 09/19/2006 15:18:49;0001;PBS_Server;Svr;PBS_Server;stream_eof,
> connection to sgsl001 is bad, remote service may be down, message may be
> corrupt, or connection may have been dropped remotely (End of File).
> setting node state to down
> ...
>
> Thanks in advance
> G. Bachler
>
> --
> -----
> Dr. G?nter Bachler
> Technical Expert System Administration CAE
> Product Development IT
> Corporate Functions / Information Technology
>
> e-mail: guenter.bachler at avl.com
> Phone: +43-316-787-3425
> Fax: +43-316-787-1847
>
> AVL LIST GMBH
> A-8020 Graz, Hans-List-Platz 1
> http://www.avl.com
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list