[torqueusers] 100+ job lauch failures - 15009 errors.

Garrick Staples garrick at usc.edu
Wed Nov 7 10:43:28 MST 2007


On Wed, Nov 07, 2007 at 08:44:42AM -0500, Andrew J Caird alleged:
> On Tue, 6 Nov 2007, Garrick Staples wrote:
> 
> >On Tue, Nov 06, 2007 at 07:27:39PM -0500, Andrew J Caird alleged:
> >
> >>"For large systems (in excess of 300 nodes) it is often valuable to 
> >>build TORQUE using TCP for inner-daemon communication rather than the 
> >>default of RPP (reliable packet protocol).  This can be accomplished 
> >>using the '--disable-rpp' configure option."
> >>
> >>etc.
> >>
> >>Is that not true still?
> >
> >The statement is false because --disable-rpp doesn't effect inter-mom 
> >communication or inter-server communcation (the two forms of 
> >inter-daemon communication).  It only effects resource requests.
> >
> >Back in the OpenPBS days, it was common for schedulers to do lots of 
> >resource requests directly to the MOMs.  One of the earliest TORQUE 
> >patches obsoleted that mechanism for schedulers.  Now momctl is the only 
> >program that issues resource requests.
> >
> >In fact, the TCP request requests have a bug in that sockets are never 
> >closed so doing lots and lots of requests tends to run out of sockets.
> 
> Can someone with Wiki access update the documentation to this effect?

The referenced URL, /torquedocs21/a.flargeclusters.shtml, isn't in the wiki.

CRI peeps have to change it.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20071107/20126d4b/attachment.bin


More information about the torqueusers mailing list