[torqueusers] 100+ job lauch failures - 15009 errors.
garrick at usc.edu
Wed Nov 7 10:43:28 MST 2007
On Wed, Nov 07, 2007 at 08:44:42AM -0500, Andrew J Caird alleged:
> On Tue, 6 Nov 2007, Garrick Staples wrote:
> >On Tue, Nov 06, 2007 at 07:27:39PM -0500, Andrew J Caird alleged:
> >>"For large systems (in excess of 300 nodes) it is often valuable to
> >>build TORQUE using TCP for inner-daemon communication rather than the
> >>default of RPP (reliable packet protocol). This can be accomplished
> >>using the '--disable-rpp' configure option."
> >>Is that not true still?
> >The statement is false because --disable-rpp doesn't effect inter-mom
> >communication or inter-server communcation (the two forms of
> >inter-daemon communication). It only effects resource requests.
> >Back in the OpenPBS days, it was common for schedulers to do lots of
> >resource requests directly to the MOMs. One of the earliest TORQUE
> >patches obsoleted that mechanism for schedulers. Now momctl is the only
> >program that issues resource requests.
> >In fact, the TCP request requests have a bug in that sockets are never
> >closed so doing lots and lots of requests tends to run out of sockets.
> Can someone with Wiki access update the documentation to this effect?
The referenced URL, /torquedocs21/a.flargeclusters.shtml, isn't in the wiki.
CRI peeps have to change it.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20071107/20126d4b/attachment.bin
More information about the torqueusers