[torqueusers] Server losing contact with pbs_mom at 102 nodes

Jones, Wesley wesley_jones at nrel.gov
Wed Nov 10 16:11:54 MST 2004


I am running torque-1.1.0p4-snap.1098376627.tar.gz built in 32-bit mode on
an AMD64 system.  Things work well when we use 102 or less nodes.  When the
nodes files has 103 nodes I get the error

11/10/2004 09:56:38;0001;PBS_Server;Svr;PBS_Server;Connection timed out
(110) in stream_eof, connection to node002 dropped.  setting node state to
down in stream_eof

In server_log/<date> file for different nodes at different times.  I usually
use pbsnodes -a to check what is available and the number of free and down
nodes is just jumping around with more than 102 nodes.  I am wondering if
anyone have seen this behavior.

Wes

-- 
Wesley B. Jones
Sr. Computational Scientist
National Renewable Energy Lab
303-275-4070






More information about the torqueusers mailing list