[torqueusers] 100+ job lauch failures - 15009 errors.

Jay Srinivasan jay at nersc.gov
Wed Nov 7 14:19:02 MST 2007


Thanks for clarifying this, Garrick. In fact, we rebuilt torque
disabling RPP based on this statement, with no effect on our problem
which was 15059 errors (not 15009).

So what might cause 15059 (cannot communicate with sister) messages on
large (300+ node) clusters then?

Jay

Garrick Staples wrote:
> On Tue, Nov 06, 2007 at 07:27:39PM -0500, Andrew J Caird alleged:
>> On Tue, 6 Nov 2007, Garrick Staples wrote:
>>
>>>> ./configure --prefix=/usr/local/pbs --enable-docs --enable-server 
>>>> --enable-mom --disable-gui --with-server-home=/var/spool/PBS 
>>>> --with-default- server=foobar --with-rcp=/usr/bin/rcp --disable-rpp
>>> Don't use --disable-rpp.  It has no effect on your reported problem and 
>>> is just a bad idea.  I don't know where people keep getting this from.
>> Here:
>>   http://www.clusterresources.com/torquedocs21/a.flargeclusters.shtml
>>
>> "For large systems (in excess of 300 nodes) it is often valuable to build 
>> TORQUE using TCP for inner-daemon communication rather than the default of 
>> RPP (reliable packet protocol).  This can be accomplished using the 
>> '--disable-rpp' configure option.  TCP scales better and has improved 
>> fault tolerance in most cases.  In addition, in very large clusters (in 
>> excess of 1,000 nodes, it may be advisable to tune a number of 
>> communication layer timeouts."
>>
>> etc.
>>
>> Is that not true still?
> 
> The statement is false because --disable-rpp doesn't effect inter-mom
> communication or inter-server communcation (the two forms of inter-daemon
> communication).  It only effects resource requests.
> 
> Back in the OpenPBS days, it was common for schedulers to do lots of resource
> requests directly to the MOMs.  One of the earliest TORQUE patches obsoleted
> that mechanism for schedulers.  Now momctl is the only program that issues
> resource requests.
> 
> In fact, the TCP request requests have a bug in that sockets are never closed
> so doing lots and lots of requests tends to run out of sockets.
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list