[torqueusers] PBS_Server just stop responding
knielson at adaptivecomputing.com
Thu Jun 14 08:30:40 MDT 2012
On Wed, Jun 13, 2012 at 9:41 PM, Ian Miller <ianm at uchicago.edu> wrote:
> Hi All,
> I have a 34 node cluster running CentOS 6 with torque 2.5.7 and maui 3.3.1
> When a user submits a job to a node and it takes up pretty much all of the
> resources on the server I've noticed that qsub and qstat will stop
> responding. My fix is to restart the pbs_server. My question Is this a
> config on the mom side that needs to be changed or is this a pbs_server end
> config that needs to be looked at. Users will submit jobs that from time
> to time will kill a node but the rest of the cluster should not suffer.
What else is happening on your system. For example, how many jobs are in
the queue? Do you have a user calling qstat over and over? This combination
on 2.5 can cause the server to appear hung because it is single threaded
and all the time is getting taken up by the qstat calls.
I would look at other things along this line as well.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers