[torqueusers] 100+ job lauch failures - 15009 errors.

Andrew J Caird acaird at umich.edu
Tue Nov 6 17:27:39 MST 2007


On Tue, 6 Nov 2007, Garrick Staples wrote:

>> ./configure --prefix=/usr/local/pbs --enable-docs --enable-server 
>> --enable-mom --disable-gui --with-server-home=/var/spool/PBS 
>> --with-default- server=foobar --with-rcp=/usr/bin/rcp --disable-rpp
>
> Don't use --disable-rpp.  It has no effect on your reported problem and 
> is just a bad idea.  I don't know where people keep getting this from.

Here:
   http://www.clusterresources.com/torquedocs21/a.flargeclusters.shtml

"For large systems (in excess of 300 nodes) it is often valuable to build 
TORQUE using TCP for inner-daemon communication rather than the default of 
RPP (reliable packet protocol).  This can be accomplished using the 
'--disable-rpp' configure option.  TCP scales better and has improved 
fault tolerance in most cases.  In addition, in very large clusters (in 
excess of 1,000 nodes, it may be advisable to tune a number of 
communication layer timeouts."

etc.

Is that not true still?

--andy


More information about the torqueusers mailing list