[torqueusers] PBS_Server just stop responding
dbeer at adaptivecomputing.com
Thu Jun 14 09:56:07 MDT 2012
How did you configure TORQUE? Did you use --with-tcp-retry-limit=? I
suggest using 5 there. pbs_server can get stuck retrying different ports
for a very long time (over 4.5 hours) because it will retry about 880
different ports to contact a certain node, and sometimes it gets stuck. If
you set this limit, you make it so that it doesn't retry more than the
number of times that you specify.
On Thu, Jun 14, 2012 at 8:30 AM, Ken Nielson <knielson at adaptivecomputing.com
> On Wed, Jun 13, 2012 at 9:41 PM, Ian Miller <ianm at uchicago.edu> wrote:
>> Hi All,
>> I have a 34 node cluster running CentOS 6 with torque 2.5.7 and maui 3.3.1
>> When a user submits a job to a node and it takes up pretty much all of
>> the resources on the server I've noticed that qsub and qstat will stop
>> responding. My fix is to restart the pbs_server. My question Is this a
>> config on the mom side that needs to be changed or is this a pbs_server end
>> config that needs to be looked at. Users will submit jobs that from time
>> to time will kill a node but the rest of the cluster should not suffer.
> What else is happening on your system. For example, how many jobs are in
> the queue? Do you have a user calling qstat over and over? This combination
> on 2.5 can cause the server to appear hung because it is single threaded
> and all the time is getting taken up by the qstat calls.
> I would look at other things along this line as well.
> torqueusers mailing list
> torqueusers at supercluster.org
David Beer | Software Engineer
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers