[torqueusers] Server losing contact with pbs_mom at 102 nodes
Jones, Wesley
wesley_jones at nrel.gov
Wed Nov 10 16:11:54 MST 2004
I am running torque-1.1.0p4-snap.1098376627.tar.gz built in 32-bit mode on
an AMD64 system. Things work well when we use 102 or less nodes. When the
nodes files has 103 nodes I get the error
11/10/2004 09:56:38;0001;PBS_Server;Svr;PBS_Server;Connection timed out
(110) in stream_eof, connection to node002 dropped. setting node state to
down in stream_eof
In server_log/<date> file for different nodes at different times. I usually
use pbsnodes -a to check what is available and the number of free and down
nodes is just jumping around with more than 102 nodes. I am wondering if
anyone have seen this behavior.
Wes
--
Wesley B. Jones
Sr. Computational Scientist
National Renewable Energy Lab
303-275-4070
More information about the torqueusers
mailing list