[torqueusers] Server losing contact with pbs_mom at 102 nodes
Jones, Wesley
wesley_jones at nrel.gov
Fri Nov 12 16:27:27 MST 2004
I am now using patch4.
If I put 102 hosts in my server_priv file things work great, and just after
all of the pbs_mom s are started pbsnodes -l (-a) indicates that all nodes
are ready.
If I put 103 hosts in my server_priv pbsnodes -l indicates that node32
through node104 are not ready.
Here is a section of server_log. You can see it is chugging along adding
new nodes, but when it gets to node031 it stops at
"sending cluster-addrs to node node031"
Momctl for node032, node031, node030 are also included.
WEs
11/12/2004 16:15:34;0040;PBS_Server;Req;do_rpp;rpp request received on
stream 28
11/12/2004 16:15:34;0040;PBS_Server;Req;do_rpp;inter-server request received
11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;message received from
stream 28 (version 1)
11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;message '4' received from
node030 (172.16.1.30:1023)
11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;IS_STATUS received from
node030
11/12/2004 16:15:34;0040;PBS_Server;Req;do_rpp;rpp request received on
stream 28
11/12/2004 16:15:34;0040;PBS_Server;Req;do_rpp;inter-server request received
11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;message received from
stream 28 (version 1)
11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;message '1' received from
node030 (172.16.1.30:1023)
11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;HELLO received from
node030
11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;sending cluster-addrs to
node node030
11/12/2004 16:15:34;0040;PBS_Server;Req;do_rpp;rpp request received on
stream 29
11/12/2004 16:15:34;0040;PBS_Server;Req;do_rpp;inter-server request received
11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;message received from
stream 29 (version 1)
11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;message '4' received from
node031 (172.16.0.31:1023)
11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;IS_STATUS received from
node031
11/12/2004 16:15:34;0040;PBS_Server;Req;do_rpp;rpp request received on
stream 29
11/12/2004 16:15:34;0040;PBS_Server;Req;do_rpp;inter-server request received
11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;message received from
stream 29 (version 1)
11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;message '1' received from
node031 (172.16.0.31:1023)
11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;HELLO received from
node031
11/12/2004 16:15:34;0004;PBS_Server;Svr;is_request;sending cluster-addrs to
node node031
11/12/2004 16:15:45;0100;PBS_Server;Req;;Type disconnect request received
from root at head.atipacluster, sock=9
11/12/2004 16:15:45;0100;PBS_Server;Req;;Type statusqueue request received
from root at head.atipacluster, sock=9
11/12/2004 16:15:45;0100;PBS_Server;Req;;Type statusjob request received
from root at head.atipacluster, sock=9
11/12/2004 16:15:53;0040;PBS_Server;Req;do_rpp;rpp request received on
stream 26
11/12/2004 16:15:53;0040;PBS_Server;Req;do_rpp;inter-server request received
Here is momctl from node32, node31 and node30
[root at node031 root]# ssh node032
Last login: Fri Nov 12 16:09:54 2004 from head.atipacluster
[root at node032 root]# momctl -d 3
Host: node032/node032 Server: 172.16.100.1 Version: torque_1.1.0p4
PID: 20600
HomeDirectory: /var/spool/PBS/mom_priv
MOM active: 36 seconds
WARNING: no messages received from server
Last Msg To Server: 6 seconds
WARNING: no hello/cluster-addrs messages received from server
Init Msgs Sent: 2 hellos
LOGLEVEL: 7 (use SIGUSR1/SIGUSR2 to adjust)
JobList: NONE
diagnostics complete
[root at node031 root]# momctl -d 3
Host: node031/node031 Server: 172.16.100.1 Version: torque_1.1.0p4
PID: 10749
HomeDirectory: /var/spool/PBS/mom_priv
MOM active: 29 seconds
WARNING: no messages received from server
Last Msg To Server: 0 seconds
WARNING: no hello/cluster-addrs messages received from server
Init Msgs Sent: 2 hellos
LOGLEVEL: 7 (use SIGUSR1/SIGUSR2 to adjust)
JobList: NONE
diagnostics complete
[root at node032 root]# ssh node030
Last login: Tue Oct 19 13:25:38 2004 from head.atipacluster
[root at node030 root]# momctl -d 2
Host: node030/node030 Server: 172.16.100.1 Version: torque_1.1.0p4
PID: 13279
HomeDirectory: /var/spool/PBS/mom_priv
MOM active: 52 seconds
WARNING: no messages received from server
Last Msg To Server: 0 seconds
WARNING: no hello/cluster-addrs messages received from server
Init Msgs Sent: 3 hellos
LOGLEVEL: 7 (use SIGUSR1/SIGUSR2 to adjust)
JobList: NONE
diagnostics complete
On 11/10/04 4:11 PM, "Jones, Wesley" <wesley_jones at nrel.gov> wrote:
> I am running torque-1.1.0p4-snap.1098376627.tar.gz built in 32-bit mode on
> an AMD64 system. Things work well when we use 102 or less nodes. When the
> nodes files has 103 nodes I get the error
>
> 11/10/2004 09:56:38;0001;PBS_Server;Svr;PBS_Server;Connection timed out
> (110) in stream_eof, connection to node002 dropped. setting node state to
> down in stream_eof
>
> In server_log/<date> file for different nodes at different times. I usually
> use pbsnodes -a to check what is available and the number of free and down
> nodes is just jumping around with more than 102 nodes. I am wondering if
> anyone have seen this behavior.
>
> Wes
More information about the torqueusers
mailing list