[torqueusers] torque 4.2.2 communication error

Peter A Ruprecht peter.ruprecht at Colorado.EDU
Thu Jun 6 12:59:25 MDT 2013


In case anyone else runs into this, here's what I found out from the
Adaptive developers (as well as I can understand it; please correct me if
I'm wrong.)  

Torque 4 contacts all the nodes in the node list when pbs_server starts
up.  If there are a lot of nodes down at the time, then this process seems
to take a long time to time out, during which time pbs_server will be
unresponsive to queries.

A workaround is to start pbs_server with the -c flag, which prevents it
from contacting all the nodes at startup.  However, after 10 minutes
pbs_server will start trying to contact the nodes and if a lot of them are
still down it will (in my experience) become unresponsive again.

 -Pete

On 6/5/13 1:23 PM, "Peter A Ruprecht" <peter.ruprecht at Colorado.EDU> wrote:

>Hi,
>
>I am trying to get torque 4.2.2 working on our cluster but it doesn't seem
>to be accepting connections from its utilities, even when these are run on
>the server itself.  For example:
>
>moab# pbsnodes -a cnode0104
>parse_daemon_response error 15033 Batch protocol error
>parse_daemon_response error 15033 Batch protocol error
>parse_daemon_response error 15033 Batch protocol error
>parse_daemon_response error 15033 Batch protocol error
>parse_daemon_response error 15033 Batch protocol error
>parse_daemon_response error 15033 Batch protocol error
>Error communicating with moab.rc.colorado.edu(10.128.0.132)
>Communication failure.
>pbsnodes: cannot connect to server moab.rc.colorado.edu, error=15096
>(Error getting connection to socket)
>
>Similarly:
>
>moab# qstat -a
>socket_read_num error
>parse_daemon_response error 15033 Batch protocol error
>parse_daemon_response error 15033 Batch protocol error
>. . .
>
>
>
>I'm not seeing any obvious problems in the system message logs.  (Server
>is RHEL6, 64-bit.)
>
>iptables and selinux are off.  This server had been running 2.5.11 just
>fine before.
>
>Any suggestions about what else I should be looking for?
>
>Thanks,
>Pete Ruprecht
>University of Colorado Boulder
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list