[torqueusers] Serious torque failure problems

Garrick Staples garrick at usc.edu
Mon Aug 15 13:10:20 MDT 2005


On Fri, Aug 12, 2005 at 07:59:32AM +1000, David Singleton alleged:
> 
> Ummm, I dont think these are the issues in PBS udp vs tcp.  The original
> PBS authors wrote RPP (Reliable Packet Protocol) on top of udp. My
> belief is that they did this to get asynchronous messaging between
> daemons.  The RPP layer has acks, retries, etc built-in but daemons do not
> block on rpp requests. Blocking tcp requests will hang for a tcp timeoout
> period if the other end is not responding.  RPP also avoids issues
> of limits on large numbers of sockets although that may be less
> of a problem now.

Just as a data point for everyone... I use RPP just fine with 1700+ nodes.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050815/3ea4167d/attachment.bin


More information about the torqueusers mailing list