[torqueusers] Intermittent pbs_server connection problems upon upgrading
nate at psu.edu
Mon Jul 26 10:15:41 MDT 2010
Garrick Staples wrote:
> Obvious place to start is to strace pbs_server and see if it is hanging on anything.
Ah yes, I did do this. Nothing it was sitting on that I could see.
> But I don't think the problem is with pbs_server because pbs_iff is returning with Connection refused. I'm pretty sure that error is occuring before anything gets to pbs_server.
I suspected load in the pbs_server, but perhaps not - are resources
shared on the client? This happens more prominently on the submit host
which is running a pretty heavily loaded application which submits and
monitors PBS jobs through pbs_python.
> On Jul 26, 2010, at 7:43 AM, Nate Coraor wrote:
>> Hi all,
>> I've recently upgraded from 2.1.11 to 2.4.8 and since doing so, have
>> been experiencing a lot of delays in communication with pbs_server.
>> qstat often takes a bit (~5-10 seconds) to respond, and sometimes
>> doesn't at all (it looks like, if the response time is > 10 seconds),
>> failing with this error:
>> pbs_iff: cannot connect to torque.example.org:15001 - timeout, errno=146
>> (Connection refused) cannot connect to port 1022 in client_to_svr -
>> connection refused
>> No Permission.
>> qstat: cannot connect to server torque.example.org (errno=15007)
>> Unauthorized Request
>> Subsequent invocations of qstat succeed. When this error is logged,
>> nothing interesting is happening in pbs_server, even if running with
>> loglevel 7, and the connection attempt is not logged at all.
>> I haven't completely ruled out connection problems, but at the very
>> least, packets aren't dropping or taking long to move between the submit
>> host and the server.
>> Is there an obvious place to start?
>> torqueusers mailing list
>> torqueusers at supercluster.org
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers