[torqueusers] 100+ job lauch failures - 15009 errors.
Andrew J Caird
acaird at umich.edu
Tue Nov 6 17:27:39 MST 2007
On Tue, 6 Nov 2007, Garrick Staples wrote:
>> ./configure --prefix=/usr/local/pbs --enable-docs --enable-server
>> --enable-mom --disable-gui --with-server-home=/var/spool/PBS
>> --with-default- server=foobar --with-rcp=/usr/bin/rcp --disable-rpp
>
> Don't use --disable-rpp. It has no effect on your reported problem and
> is just a bad idea. I don't know where people keep getting this from.
Here:
http://www.clusterresources.com/torquedocs21/a.flargeclusters.shtml
"For large systems (in excess of 300 nodes) it is often valuable to build
TORQUE using TCP for inner-daemon communication rather than the default of
RPP (reliable packet protocol). This can be accomplished using the
'--disable-rpp' configure option. TCP scales better and has improved
fault tolerance in most cases. In addition, in very large clusters (in
excess of 1,000 nodes, it may be advisable to tune a number of
communication layer timeouts."
etc.
Is that not true still?
--andy
More information about the torqueusers
mailing list