[torqueusers] Torque behaving badly

Jagga Soorma jagga13 at gmail.com
Mon Nov 18 19:06:05 MST 2013


I was able to resolve my intermittent connection issues by setting the
following kernel tunables on the client:

sysctl -w net.ipv4.tcp_timestamps=1
sysctl -w net.ipv4.tcp_tw_recycle=1

However, there is only 1 server and 1 client in this torque test
environment.  So, I still don't understand why setting the fast recycle of
sockets that are in a time_wait state would help or be needed in this
case.  I might be masking the real problem.

Has anyone run into this issue before?

Thanks,
-J


On Mon, Nov 18, 2013 at 3:50 PM, Ken Nielson <knielson at adaptivecomputing.com
> wrote:

> The "cannot connect" message looks suspiciously like it could be a
> firewall problem.
>
> Regards
>
>
> On Fri, Nov 15, 2013 at 1:11 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>
>> So, this is a brand new install of torque without anything running on the
>> server/client except the torque processes.  I checked and I don't think the
>> server is running into any process limits.
>>
>> I setup the server & sched processes on the client itself and now am
>> running everything on the client host to rule out external components.  I
>> see the same problem with the connection to 15002 being a problem.  I had a
>> 1Gig copper connection on this server as well and migrated my network to  a
>> completely different nic and that did not help either.
>>
>> This is really a bizarre one that I can't seem to find the cause for.
>>  Any other things you guys think might help me troubleshoot this problem?
>>
>> Thanks,
>> -J
>>
>>
>> On Fri, Nov 15, 2013 at 4:05 AM, Jonathan Barber <
>> jonathan.barber at gmail.com> wrote:
>>
>>> On 15 November 2013 03:18, Jagga Soorma <jagga13 at gmail.com> wrote:
>>>
>>>> I changed the log level and here is what I see on the server:
>>>>
>>>> Looks like it is intermittently having issues connecting to port 15002
>>>> on the client.  This client was just fine under the 2.5.9 torque production
>>>> environment that we have but seems to be intermittently having issues in
>>>> the 2.5.13 test environment that is setup with gpu support.
>>>>
>>>> [snip]
>>>
>>>>
>>>> 11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
>>>> setting job 7352.server1.xxx.com state from QUEUED-QUEUED to
>>>> RUNNING-PRERUN (4-40)
>>>> 11/14/2013 19:15:20;0008;PBS_Server;Job;7352.server1.xxx.com;forking
>>>> in send_job
>>>>
>>>> *11/14/2013 19:15:20;0004;PBS_Server;Svr;svr_connect;attempting connect
>>>> to host 72.34.135.64 port 1500211/14/2013
>>>> 19:15:20;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002
>>>> - cannot establish connection () - time=0 seconds*
>>>>
>>>> *11/14/2013 19:15:22;0004;PBS_Server;Svr;svr_connect;attempting connect
>>>> to host 72.34.135.64 port 1500211/14/2013
>>>> 19:15:22;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002
>>>> - cannot establish connection () - time=0 seconds*
>>>> 11/14/2013 19:15:22;0008;PBS_Server;Job;7352.server1.xxx.com;entering
>>>> post_sendmom
>>>>
>>>
>>> You might be running up against limits on the number of file descriptors
>>> the pbs_server process or the OS is allowed to have open. You can use tools
>>> such as lsof to see how many files the pbs_server has open:
>>> $ sudo lsof -c pbs_server
>>>
>>> It's also possible that you're running out of ports to bind to. Running
>>> lsof/netstat and looking to see if there are massive numbers of
>>> connections/files open will reveal this.
>>>
>>> Although you say there is no firewall configured on the servers, do you
>>> know if there a firewall between the pbs_server and the nodes?
>>>
>>> You can do a simple TCP connect to the mom to see if it's listening:
>>> $ nmap -p 15002 ava01.grid.fe.up.pt -oG -
>>> # Nmap 6.40 scan initiated Fri Nov 15 11:52:17 2013 as: nmap -p 15002
>>> -oG - ava01.grid.fe.up.pt
>>> Host: 192.168.147.1 (ava01.grid.fe.up.pt) Status: Up
>>> Host: 192.168.147.1 (ava01.grid.fe.up.pt) Ports:
>>> 15002/open/tcp//unknown///
>>> # Nmap done at Fri Nov 15 11:52:17 2013 -- 1 IP address (1 host up)
>>> scanned in 0.04 seconds
>>> $
>>>
>>> Or continuously with hping3 (I'm sure there are other tools that will do
>>> this as well):
>>> $ sudo hping3 -S -p 15002 ava01.grid.fe.up.pt
>>> HPING ava01.grid.fe.up.pt (em1 192.168.147.1): S set, 40 headers + 0
>>> data bytes
>>> len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=0
>>> win=14600 rtt=1.5 ms
>>> len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=1
>>> win=14600 rtt=0.8 ms
>>> len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=2
>>> win=14600 rtt=0.6 ms
>>> len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=3
>>> win=14600 rtt=1.0 ms
>>> len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=4
>>> win=14600 rtt=1.2 ms
>>>
>>> (SA means it's open)
>>>
>>> HTH
>>> --
>>> Jonathan Barber <jonathan.barber at gmail.com>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
> --
> Ken Nielson
> +1 801.717.3700 office +1 801.717.3738 fax
> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
> www.adaptivecomputing.com
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131118/45f8f0a2/attachment.html 


More information about the torqueusers mailing list