[torqueusers] Intermittent pbs_server connection problems upon upgrading
knielson at adaptivecomputing.com
Mon Jul 26 08:56:22 MDT 2010
On 07/26/2010 08:43 AM, Nate Coraor wrote:
> Hi all,
> I've recently upgraded from 2.1.11 to 2.4.8 and since doing so, have
> been experiencing a lot of delays in communication with pbs_server.
> qstat often takes a bit (~5-10 seconds) to respond, and sometimes
> doesn't at all (it looks like, if the response time is> 10 seconds),
> failing with this error:
> pbs_iff: cannot connect to torque.example.org:15001 - timeout, errno=146
> (Connection refused) cannot connect to port 1022 in client_to_svr -
> connection refused
> No Permission.
> qstat: cannot connect to server torque.example.org (errno=15007)
> Unauthorized Request
> Subsequent invocations of qstat succeed. When this error is logged,
> nothing interesting is happening in pbs_server, even if running with
> loglevel 7, and the connection attempt is not logged at all.
> I haven't completely ruled out connection problems, but at the very
> least, packets aren't dropping or taking long to move between the submit
> host and the server.
> Is there an obvious place to start?
> torqueusers mailing list
> torqueusers at supercluster.org
How many nodes are in your cluster? Also do you have job_stat_rate set
in your server parameters.
More information about the torqueusers