[torqueusers] Server losing contact with pbs_mom at 102 nodes

Jones, Wesley wesley_jones at nrel.gov
Fri Nov 12 16:27:27 MST 2004


I am now using patch4.

If I put 102 hosts in my server_priv file things work great, and just after
all of the pbs_mom s are started pbsnodes -l (-a) indicates that all nodes
are ready.

If I put 103 hosts in my server_priv pbsnodes -l indicates that node32
through node104 are not ready.

Here is a section of server_log.  You can see it is chugging along adding
new nodes, but when it gets to node031 it stops at
"sending cluster-addrs to node node031"

Momctl for node032, node031, node030 are also included.

WEs

11/12/2004 16:15:34;0040;PBS_Server;Req;do_rpp;rpp request received on
stream 28

11/12/2004 16:15:34;0040;PBS_Server;Req;do_rpp;inter-server request received

11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;message received from
stream 28 (version 1)

11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;message '4' received from
node030 (172.16.1.30:1023)
11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;IS_STATUS received from
node030

11/12/2004 16:15:34;0040;PBS_Server;Req;do_rpp;rpp request received on
stream 28

11/12/2004 16:15:34;0040;PBS_Server;Req;do_rpp;inter-server request received

11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;message received from
stream 28 (version 1)

11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;message '1' received from
node030 (172.16.1.30:1023)
11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;HELLO received from
node030

11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;sending cluster-addrs to
node node030

11/12/2004 16:15:34;0040;PBS_Server;Req;do_rpp;rpp request received on
stream 29

11/12/2004 16:15:34;0040;PBS_Server;Req;do_rpp;inter-server request received

11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;message received from
stream 29 (version 1)

11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;message '4' received from
node031 (172.16.0.31:1023)
11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;IS_STATUS received from
node031

11/12/2004 16:15:34;0040;PBS_Server;Req;do_rpp;rpp request received on
stream 29

11/12/2004 16:15:34;0040;PBS_Server;Req;do_rpp;inter-server request received

11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;message received from
stream 29 (version 1)

11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;message '1' received from
node031 (172.16.0.31:1023)
11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;HELLO received from
node031

11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;sending cluster-addrs to
node node031

11/12/2004 16:15:45;0100;PBS_Server;Req;;Type disconnect request received
from root at head.atipacluster, sock=9
11/12/2004 16:15:45;0100;PBS_Server;Req;;Type statusqueue request received
from root at head.atipacluster, sock=9
11/12/2004 16:15:45;0100;PBS_Server;Req;;Type statusjob request received
from root at head.atipacluster, sock=9
11/12/2004 16:15:53;0040;PBS_Server;Req;do_rpp;rpp request received on
stream 26

11/12/2004 16:15:53;0040;PBS_Server;Req;do_rpp;inter-server request received




Here is momctl from node32, node31 and node30

[root at node031 root]# ssh node032
Last login: Fri Nov 12 16:09:54 2004 from head.atipacluster
[root at node032 root]# momctl -d 3

Host: node032/node032   Server: 172.16.100.1   Version: torque_1.1.0p4
PID:                    20600
HomeDirectory:          /var/spool/PBS/mom_priv
MOM active:             36 seconds
WARNING:  no messages received from server
Last Msg To Server:     6 seconds
WARNING:  no hello/cluster-addrs messages received from server
Init Msgs Sent:         2 hellos
LOGLEVEL:               7 (use SIGUSR1/SIGUSR2 to adjust)
JobList:                NONE

diagnostics complete

[root at node031 root]# momctl -d 3

Host: node031/node031   Server: 172.16.100.1   Version: torque_1.1.0p4
PID:                    10749
HomeDirectory:          /var/spool/PBS/mom_priv
MOM active:             29 seconds
WARNING:  no messages received from server
Last Msg To Server:     0 seconds
WARNING:  no hello/cluster-addrs messages received from server
Init Msgs Sent:         2 hellos
LOGLEVEL:               7 (use SIGUSR1/SIGUSR2 to adjust)
JobList:                NONE

diagnostics complete


[root at node032 root]# ssh node030
Last login: Tue Oct 19 13:25:38 2004 from head.atipacluster
[root at node030 root]# momctl -d 2

Host: node030/node030   Server: 172.16.100.1   Version: torque_1.1.0p4
PID:                    13279
HomeDirectory:          /var/spool/PBS/mom_priv
MOM active:             52 seconds
WARNING:  no messages received from server
Last Msg To Server:     0 seconds
WARNING:  no hello/cluster-addrs messages received from server
Init Msgs Sent:         3 hellos
LOGLEVEL:               7 (use SIGUSR1/SIGUSR2 to adjust)
JobList:                NONE

diagnostics complete


On 11/10/04 4:11 PM, "Jones, Wesley" <wesley_jones at nrel.gov> wrote:

> I am running torque-1.1.0p4-snap.1098376627.tar.gz built in 32-bit mode on
> an AMD64 system.  Things work well when we use 102 or less nodes.  When the
> nodes files has 103 nodes I get the error
> 
> 11/10/2004 09:56:38;0001;PBS_Server;Svr;PBS_Server;Connection timed out
> (110) in stream_eof, connection to node002 dropped.  setting node state to
> down in stream_eof
> 
> In server_log/<date> file for different nodes at different times.  I usually
> use pbsnodes -a to check what is available and the number of free and down
> nodes is just jumping around with more than 102 nodes.  I am wondering if
> anyone have seen this behavior.
> 
> Wes



More information about the torqueusers mailing list