[torqueusers] Torque behaving badly

Jonathan Barber jonathan.barber at gmail.com
Tue Nov 19 01:48:38 MST 2013


On 19 November 2013 02:06, Jagga Soorma <jagga13 at gmail.com> wrote:

> I was able to resolve my intermittent connection issues by setting the
> following kernel tunables on the client:
>
> sysctl -w net.ipv4.tcp_timestamps=1
> sysctl -w net.ipv4.tcp_tw_recycle=1
>
> However, there is only 1 server and 1 client in this torque test
> environment.  So, I still don't understand why setting the fast recycle of
> sockets that are in a time_wait state would help or be needed in this
> case.  I might be masking the real problem.
>

Strange.

With net.ipv4.tcp_tw_recycle disabled set to 0, do you actually see many
sockets in state TIME_WAIT with netstat when you hit the problem?

Perhaps you could run tcpdump/wireshark on the server and the client and
examine the TCP streams for errors and to make sure that the client is
receiving everything that is sent.


> Has anyone run into this issue before?
>
> Thanks,
> -J
>
>
> On Mon, Nov 18, 2013 at 3:50 PM, Ken Nielson <
> knielson at adaptivecomputing.com> wrote:
>
>> The "cannot connect" message looks suspiciously like it could be a
>> firewall problem.
>>
>> Regards
>>
>>
>> On Fri, Nov 15, 2013 at 1:11 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>>
>>> So, this is a brand new install of torque without anything running on
>>> the server/client except the torque processes.  I checked and I don't think
>>> the server is running into any process limits.
>>>
>>> I setup the server & sched processes on the client itself and now am
>>> running everything on the client host to rule out external components.  I
>>> see the same problem with the connection to 15002 being a problem.  I had a
>>> 1Gig copper connection on this server as well and migrated my network to  a
>>> completely different nic and that did not help either.
>>>
>>> This is really a bizarre one that I can't seem to find the cause for.
>>>  Any other things you guys think might help me troubleshoot this problem?
>>>
>>> Thanks,
>>> -J
>>>
>>>
>>> On Fri, Nov 15, 2013 at 4:05 AM, Jonathan Barber <
>>> jonathan.barber at gmail.com> wrote:
>>>
>>>> On 15 November 2013 03:18, Jagga Soorma <jagga13 at gmail.com> wrote:
>>>>
>>>>> I changed the log level and here is what I see on the server:
>>>>>
>>>>> Looks like it is intermittently having issues connecting to port 15002
>>>>> on the client.  This client was just fine under the 2.5.9 torque production
>>>>> environment that we have but seems to be intermittently having issues in
>>>>> the 2.5.13 test environment that is setup with gpu support.
>>>>>
>>>>> [snip]
>>>>
>>>>>
>>>>> 11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
>>>>> setting job 7352.server1.xxx.com state from QUEUED-QUEUED to
>>>>> RUNNING-PRERUN (4-40)
>>>>> 11/14/2013 19:15:20;0008;PBS_Server;Job;7352.server1.xxx.com;forking
>>>>> in send_job
>>>>>
>>>>> *11/14/2013 19:15:20;0004;PBS_Server;Svr;svr_connect;attempting
>>>>> connect to host 72.34.135.64 port 1500211/14/2013
>>>>> 19:15:20;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002
>>>>> - cannot establish connection () - time=0 seconds*
>>>>>
>>>>> *11/14/2013 19:15:22;0004;PBS_Server;Svr;svr_connect;attempting
>>>>> connect to host 72.34.135.64 port 1500211/14/2013
>>>>> 19:15:22;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002
>>>>> - cannot establish connection () - time=0 seconds*
>>>>> 11/14/2013 19:15:22;0008;PBS_Server;Job;7352.server1.xxx.com;entering
>>>>> post_sendmom
>>>>>
>>>>
>>>> You might be running up against limits on the number of file
>>>> descriptors the pbs_server process or the OS is allowed to have open. You
>>>> can use tools such as lsof to see how many files the pbs_server has open:
>>>> $ sudo lsof -c pbs_server
>>>>
>>>> It's also possible that you're running out of ports to bind to. Running
>>>> lsof/netstat and looking to see if there are massive numbers of
>>>> connections/files open will reveal this.
>>>>
>>>> Although you say there is no firewall configured on the servers, do you
>>>> know if there a firewall between the pbs_server and the nodes?
>>>>
>>>> You can do a simple TCP connect to the mom to see if it's listening:
>>>> $ nmap -p 15002 ava01.grid.fe.up.pt -oG -
>>>> # Nmap 6.40 scan initiated Fri Nov 15 11:52:17 2013 as: nmap -p 15002
>>>> -oG - ava01.grid.fe.up.pt
>>>> Host: 192.168.147.1 (ava01.grid.fe.up.pt) Status: Up
>>>> Host: 192.168.147.1 (ava01.grid.fe.up.pt) Ports:
>>>> 15002/open/tcp//unknown///
>>>> # Nmap done at Fri Nov 15 11:52:17 2013 -- 1 IP address (1 host up)
>>>> scanned in 0.04 seconds
>>>> $
>>>>
>>>> Or continuously with hping3 (I'm sure there are other tools that will
>>>> do this as well):
>>>> $ sudo hping3 -S -p 15002 ava01.grid.fe.up.pt
>>>> HPING ava01.grid.fe.up.pt (em1 192.168.147.1): S set, 40 headers + 0
>>>> data bytes
>>>> len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=0
>>>> win=14600 rtt=1.5 ms
>>>> len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=1
>>>> win=14600 rtt=0.8 ms
>>>> len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=2
>>>> win=14600 rtt=0.6 ms
>>>> len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=3
>>>> win=14600 rtt=1.0 ms
>>>> len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=4
>>>> win=14600 rtt=1.2 ms
>>>>
>>>> (SA means it's open)
>>>>
>>>> HTH
>>>> --
>>>> Jonathan Barber <jonathan.barber at gmail.com>
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>
>> --
>> Ken Nielson
>> +1 801.717.3700 office +1 801.717.3738 fax
>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>> www.adaptivecomputing.com
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
Jonathan Barber <jonathan.barber at gmail.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131119/b696b6be/attachment-0001.html 


More information about the torqueusers mailing list