[torquedev] "Premature end of message dropped", large cluster, ip-sec, nodes down

cm42 at gmx.de cm42 at gmx.de
Thu Jan 26 06:36:52 MST 2006

Dear Torque developers, 
in advance many thanks for your help. I hope, this is the correct list 
for this posting. If not, please feel free to move it to the users list. 
Here is the problem description: 
***** Cluster and system description: 
* 20 external IP nodes connected to the server using ipsec transport mode 
* 260 internal IP nodes without any encryption or firewall 
* All 280 nodes configured as "normal" space shared (not time-shared) 
* Torque server has two interfaces, one internal and one external 
* OS: Suse 10.0 Kernel 2.6.13 
* Hardware: Most nodes fast P4, some slower Athlon XP, Server fast P4 
* Network: Internal GBit, external 100MBit 
* torque-2.0.0p6-snap.200601242349 (in older versions same problem) 
* Sheduler: Maui (but not imporant here) 
***** Problem: 
Just a view seconds after start of torque, all external ip-sec- 
connected nodes appear as down in 'pbsnodes -a' 
(and no job starts on these nodes). 
Concerning the log-files (loglevel 9 on server and moms), the external 
moms _receive_ _and_ _return_ the HELLO-ping: 
--- begin log 
01/26/2006 13:46:48;0004;PBS_Server;Svr;is_request;message HELLO (1) 
received from mom on host wega (ip.of.node.wega:1023) 
01/26/2006 13:46:48;0004;PBS_Server;Svr;is_request;HELLO received from 
01/26/2006 13:46:48;0004;PBS_Server;Svr;is_request;sending cluster-addrs 
to node wega 
--- end log 
+ Message repeated every 10 seconds in log file, 
+ but with a gap of 8 mins after each Premature message (see below). 
+ That means, the HELLO message appears 19 times 
+ in the 3 minuts before the next Premature message. 
+ The Premature ends the HELLO messages (for 8 minutes). 
But the external nodes do not receive any 
further commands from the server, 
e.g. the next messages in the servers log is: 
--- begin log 
01/26/2006 13:46:56;0001;PBS_Server;Svr;PBS_Server;stream_eof, connection 
to Premature end of message dropped (wega).  setting node state 
 to down 
--- end log 
+ Message repeated every 11 mins 10 secs in log file. 
--- end 
***** BUT (it is not ip-sec, it is not the network, ...): 
If we remove 220 of the 260 internal nodes and start the server again, 
all remaining 60 nodes (20 ext + 40 int) are free now. (So there is no 
fundamental problem with ip sec or firewall or network timing or 
operating system or whatever.) 20 ext + 80 int nodes is still ok, 
but starting with about 120 int + 20 ext nodes, the external nodes are 
down again. 
If we put back all 260 internal nodes to the queue togetzer with the 
20 external nodes, but now we shutdown the ip-sec-subsystem, 
then again all nodes are free (including the external ones). 
So I have no idea here, what the problem could be. 
***** Torque configuration and further remarks: 
* --disable-rpp (with enabled rpp same problem) 
* Tried to increase different timing parameters without effect, 
  e.g. src/lib/Libifl/tcp_dis.c: time_t pbs_tcp_timeout = 40 
  Important to note, that the down-problem appears 3 seconds after 
  torque start. So timings of 60 seconds should not be the problem. 
* "set server tcp_timeout = 40" 
* All nodes, that are reported to be free, 
  also show their load, uname, totmem etc in "status =" of "pbsnodes -a". 
  The external 'down'-nodes show no "status =" at all. 
* And yes, we tried nearly everything, what is suggested in 
  "F. Large Cluster Considerations" of torque manual. 
Many thanks for any help. 

Telefonieren Sie schon oder sparen Sie noch?
NEU: GMX Phone_Flat http://www.gmx.net/de/go/telefonie

More information about the torquedev mailing list