[torqueusers] 100+ job lauch failures - 15009 errors.
Andrew J Caird
acaird at umich.edu
Wed Nov 7 06:44:42 MST 2007
On Tue, 6 Nov 2007, Garrick Staples wrote:
> On Tue, Nov 06, 2007 at 07:27:39PM -0500, Andrew J Caird alleged:
>> "For large systems (in excess of 300 nodes) it is often valuable to
>> build TORQUE using TCP for inner-daemon communication rather than the
>> default of RPP (reliable packet protocol). This can be accomplished
>> using the '--disable-rpp' configure option."
>> Is that not true still?
> The statement is false because --disable-rpp doesn't effect inter-mom
> communication or inter-server communcation (the two forms of
> inter-daemon communication). It only effects resource requests.
> Back in the OpenPBS days, it was common for schedulers to do lots of
> resource requests directly to the MOMs. One of the earliest TORQUE
> patches obsoleted that mechanism for schedulers. Now momctl is the only
> program that issues resource requests.
> In fact, the TCP request requests have a bug in that sockets are never
> closed so doing lots and lots of requests tends to run out of sockets.
Can someone with Wiki access update the documentation to this effect?
More information about the torqueusers