[torqueusers] Intermittent pbs_server connection problems upon upgrading

Nate Coraor nate at psu.edu
Thu Jul 29 09:48:14 MDT 2010


Nate Coraor wrote:
> Garrick Staples wrote:
>> Obvious place to start is to strace pbs_server and see if it is hanging on anything.
> 
> Hi Garrick,
> 
> Ah yes, I did do this.  Nothing it was sitting on that I could see.
> 
>> But I don't think the problem is with pbs_server because pbs_iff is returning with Connection refused. I'm pretty sure that error is occuring before anything gets to pbs_server.
> 
> I suspected load in the pbs_server, but perhaps not - are resources 
> shared on the client?  This happens more prominently on the submit host 
> which is running a pretty heavily loaded application which submits and 
> monitors PBS jobs through pbs_python.

Hi Garrick,

It looks like you're correct.  I downgraded the submit host to 2.1.11 
and the problem immediately vanished.  I didn't even restart the client 
application and the server/exec hosts are still running 2.4.8.  Does 
this provide any insight as to what's going on?

Thanks,
--nate

> 
> Thanks,
> --nate
> 
>>
>> On Jul 26, 2010, at 7:43 AM, Nate Coraor wrote:
>>
>>> Hi all,
>>>
>>> I've recently upgraded from 2.1.11 to 2.4.8 and since doing so, have 
>>> been experiencing a lot of delays in communication with pbs_server. 
>>> qstat often takes a bit (~5-10 seconds) to respond, and sometimes 
>>> doesn't at all (it looks like, if the response time is > 10 seconds), 
>>> failing with this error:
>>>
>>> pbs_iff: cannot connect to torque.example.org:15001 - timeout, errno=146 
>>> (Connection refused) cannot connect to port 1022 in client_to_svr - 
>>> connection refused
>>> No Permission.
>>> qstat: cannot connect to server torque.example.org (errno=15007) 
>>> Unauthorized Request
>>>
>>> Subsequent invocations of qstat succeed.  When this error is logged, 
>>> nothing interesting is happening in pbs_server, even if running with 
>>> loglevel 7, and the connection attempt is not logged at all.
>>>
>>> I haven't completely ruled out connection problems, but at the very 
>>> least, packets aren't dropping or taking long to move between the submit 
>>> host and the server.
>>>
>>> Is there an obvious place to start?
>>>
>>> Thanks,
>>> --nate
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list