[torqueusers] Intermittent pbs_server connection problems upon upgrading

Nate Coraor nate at psu.edu
Mon Jul 26 09:09:02 MDT 2010


Ken Nielson wrote:
> On 07/26/2010 08:43 AM, Nate Coraor wrote:
>> Hi all,
>>
>> I've recently upgraded from 2.1.11 to 2.4.8 and since doing so, have
>> been experiencing a lot of delays in communication with pbs_server.
>> qstat often takes a bit (~5-10 seconds) to respond, and sometimes
>> doesn't at all (it looks like, if the response time is>  10 seconds),
>> failing with this error:
>>
>> pbs_iff: cannot connect to torque.example.org:15001 - timeout, errno=146
>> (Connection refused) cannot connect to port 1022 in client_to_svr -
>> connection refused
>> No Permission.
>> qstat: cannot connect to server torque.example.org (errno=15007)
>> Unauthorized Request
>>
>> Subsequent invocations of qstat succeed.  When this error is logged,
>> nothing interesting is happening in pbs_server, even if running with
>> loglevel 7, and the connection attempt is not logged at all.
>>
>> I haven't completely ruled out connection problems, but at the very
>> least, packets aren't dropping or taking long to move between the submit
>> host and the server.
>>
>> Is there an obvious place to start?
>>
>> Thanks,
>> --nate
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>    
> Nate,
> 
> How many nodes are in your cluster? Also do you have job_stat_rate set 
> in your server parameters.
> 
> Ken

Hi Ken,

Currently 19 nodes, and yes it is:

Qmgr: list server
Server hostname.example.org
         server_state = Idle
         total_jobs = 18
         state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:18 
Exiting:0
         acl_hosts = *.example.org,hostname.example.org
         managers = ...
         operators = ...
         default_queue = batch
         log_events = 511
         mail_from = adm
         scheduler_iteration = 600
         node_check_rate = 150
         tcp_timeout = 6
         job_stat_rate = 60
         poll_jobs = True
         pbs_version = 2.4.8
         next_job_number = 852224
         net_counter = 6 6 5
         server_name = torque.example.org

tcp_timeout was lowered and both job_stat_rate and poll_jobs were set 
since this started, none of them have had an effect for better or worse.

Thanks,
--nate

> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list