[torqueusers] 100+ job lauch failures - 15009 errors.

Andrew J Caird acaird at umich.edu
Wed Nov 7 06:44:42 MST 2007


On Tue, 6 Nov 2007, Garrick Staples wrote:

> On Tue, Nov 06, 2007 at 07:27:39PM -0500, Andrew J Caird alleged:
>
>> "For large systems (in excess of 300 nodes) it is often valuable to 
>> build TORQUE using TCP for inner-daemon communication rather than the 
>> default of RPP (reliable packet protocol).  This can be accomplished 
>> using the '--disable-rpp' configure option."
>>
>> etc.
>>
>> Is that not true still?
>
> The statement is false because --disable-rpp doesn't effect inter-mom 
> communication or inter-server communcation (the two forms of 
> inter-daemon communication).  It only effects resource requests.
>
> Back in the OpenPBS days, it was common for schedulers to do lots of 
> resource requests directly to the MOMs.  One of the earliest TORQUE 
> patches obsoleted that mechanism for schedulers.  Now momctl is the only 
> program that issues resource requests.
>
> In fact, the TCP request requests have a bug in that sockets are never 
> closed so doing lots and lots of requests tends to run out of sockets.

Can someone with Wiki access update the documentation to this effect?

Thanks.
--andy


More information about the torqueusers mailing list