[torqueusers] Serious torque failure problems

David Singleton David.Singleton at anu.edu.au
Thu Aug 11 15:59:32 MDT 2005

Ummm, I dont think these are the issues in PBS udp vs tcp.  The original
PBS authors wrote RPP (Reliable Packet Protocol) on top of udp. My
belief is that they did this to get asynchronous messaging between
daemons.  The RPP layer has acks, retries, etc built-in but daemons do not
block on rpp requests. Blocking tcp requests will hang for a tcp timeoout
period if the other end is not responding.  RPP also avoids issues
of limits on large numbers of sockets although that may be less
of a problem now.


Stewart Samuels wrote:
> By default, torque/pbs use rpp (udp).  I have been thinking about
> disabling rpp which then forces torque to use tcp instead of udp.  I am
> curious about how many torque sites do this?  Since udp does NOT
> guarantee packet transfer and tcp does, I would think tcp should be the
> default.  The obvious drawback is that because udp does not have the
> overhead tcp does, higher performance can be achieved.  So, it seems a
> trade-off between communication success and performance.  On the other
> hand, if packet communication is not robust, what good is performance.
> How many sites disable rpp?  And why have you elected to do so?
> 	Stewart
> On Thu, 2005-08-11 at 11:24, Marc Langlois wrote:
>>On Thu, 2005-08-11 at 07:52, Chris Johnson wrote:
>>>     Hi,
>>>     Running torque-1.2.0p2 here on CentOS 4.X across a cluster composed 
>>>of AMDs, P4s, XEONs, and Opterons, several hundred total.
>>>     At the moment torque is becoming useless.  We are using the out of 
>>>box the C scheduler and it keeps dieing completely.  We are seeing jobs
>>>running on nodes which they are not listed in any log as officially having
>>>been run on.  The scheduler gets hung up on problem nodes and tries to
>>>run many many jobs on the bad node.  The scheduler/server can't pick up
>>>on the fact that a node is bad due to a situation in which mom seems to 
>>>respond but other linux services have failed.  And my boss is about ready
>>>to trash torque.  Can't blame him, we're spending way too much time
>>>maintaining this cluster.  Researchers aren't any too thrilled either.  
>>>     As fas as I know torque is used in a lot of places.  And I don't hear
>>>about these problems other places.  What the hell is going on?  I REALLY
>>>need to get this corrected.  And I'll provide any information I can.  
>>>But my community is about ready to start telling people what not to use for 
>>>cluster operations.  
>>>     Help would be GREATLY appreciated.  
>>Hi Chris,
>>I had a similar experience with PBS a few years ago. Although it could
>>be a bit dated, I found that the default C scheduler worked fine for
>>testing, but as soon as I rolled it into production, it failed
>>miserably. Switching to the Maui scheduler solved all my problems.
>>Another thing that helped was using the "--disable-rpp" flag when
>>running configure for torque. It seems that RPP was flooding the network
>>with UDP traffic that hung the PBS server and scheduler (we were running
>>on Solaris).
>>As far finding an alternative system, I recently gave SGE a try to see
>>how it compared to torque. Could be that I'm familiar with the PBS way,
>>but SGE had it's own set of quirks that I couldn't get around, so I
>>dropped it and came back to torque.
>>Hope this helps. Good luck!
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

    Dr David Singleton               ANU Supercomputer Facility
    HPC Systems Manager              and APAC National Facility
    David.Singleton at anu.edu.au       Leonard Huxley Bldg (No. 56)
    Phone: +61 2 6125 4389           Australian National University
    Fax:   +61 2 6125 8199           Canberra, ACT, 0200, Australia

More information about the torqueusers mailing list