[torqueusers] 100+ job lauch failures - 15009 errors.
garrick at usc.edu
Tue Nov 6 17:32:24 MST 2007
On Tue, Nov 06, 2007 at 07:27:39PM -0500, Andrew J Caird alleged:
> On Tue, 6 Nov 2007, Garrick Staples wrote:
> >>./configure --prefix=/usr/local/pbs --enable-docs --enable-server
> >>--enable-mom --disable-gui --with-server-home=/var/spool/PBS
> >>--with-default- server=foobar --with-rcp=/usr/bin/rcp --disable-rpp
> >Don't use --disable-rpp. It has no effect on your reported problem and
> >is just a bad idea. I don't know where people keep getting this from.
> "For large systems (in excess of 300 nodes) it is often valuable to build
> TORQUE using TCP for inner-daemon communication rather than the default of
> RPP (reliable packet protocol). This can be accomplished using the
> '--disable-rpp' configure option. TCP scales better and has improved
> fault tolerance in most cases. In addition, in very large clusters (in
> excess of 1,000 nodes, it may be advisable to tune a number of
> communication layer timeouts."
> Is that not true still?
The statement is false because --disable-rpp doesn't effect inter-mom
communication or inter-server communcation (the two forms of inter-daemon
communication). It only effects resource requests.
Back in the OpenPBS days, it was common for schedulers to do lots of resource
requests directly to the MOMs. One of the earliest TORQUE patches obsoleted
that mechanism for schedulers. Now momctl is the only program that issues
In fact, the TCP request requests have a bug in that sockets are never closed
so doing lots and lots of requests tends to run out of sockets.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20071106/f07d3f75/attachment.bin
More information about the torqueusers