[torqueusers] Intermittent pbs_server connection problems upon upgrading
Nate Coraor
nate at psu.edu
Mon Jul 26 09:09:02 MDT 2010
Ken Nielson wrote:
> On 07/26/2010 08:43 AM, Nate Coraor wrote:
>> Hi all,
>>
>> I've recently upgraded from 2.1.11 to 2.4.8 and since doing so, have
>> been experiencing a lot of delays in communication with pbs_server.
>> qstat often takes a bit (~5-10 seconds) to respond, and sometimes
>> doesn't at all (it looks like, if the response time is> 10 seconds),
>> failing with this error:
>>
>> pbs_iff: cannot connect to torque.example.org:15001 - timeout, errno=146
>> (Connection refused) cannot connect to port 1022 in client_to_svr -
>> connection refused
>> No Permission.
>> qstat: cannot connect to server torque.example.org (errno=15007)
>> Unauthorized Request
>>
>> Subsequent invocations of qstat succeed. When this error is logged,
>> nothing interesting is happening in pbs_server, even if running with
>> loglevel 7, and the connection attempt is not logged at all.
>>
>> I haven't completely ruled out connection problems, but at the very
>> least, packets aren't dropping or taking long to move between the submit
>> host and the server.
>>
>> Is there an obvious place to start?
>>
>> Thanks,
>> --nate
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
> Nate,
>
> How many nodes are in your cluster? Also do you have job_stat_rate set
> in your server parameters.
>
> Ken
Hi Ken,
Currently 19 nodes, and yes it is:
Qmgr: list server
Server hostname.example.org
server_state = Idle
total_jobs = 18
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:18
Exiting:0
acl_hosts = *.example.org,hostname.example.org
managers = ...
operators = ...
default_queue = batch
log_events = 511
mail_from = adm
scheduler_iteration = 600
node_check_rate = 150
tcp_timeout = 6
job_stat_rate = 60
poll_jobs = True
pbs_version = 2.4.8
next_job_number = 852224
net_counter = 6 6 5
server_name = torque.example.org
tcp_timeout was lowered and both job_stat_rate and poll_jobs were set
since this started, none of them have had an effect for better or worse.
Thanks,
--nate
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list