[torqueusers] Torque behaving badly

Ken Nielson knielson at adaptivecomputing.com
Mon Nov 18 16:50:13 MST 2013


The "cannot connect" message looks suspiciously like it could be a firewall
problem.

Regards


On Fri, Nov 15, 2013 at 1:11 PM, Jagga Soorma <jagga13 at gmail.com> wrote:

> So, this is a brand new install of torque without anything running on the
> server/client except the torque processes.  I checked and I don't think the
> server is running into any process limits.
>
> I setup the server & sched processes on the client itself and now am
> running everything on the client host to rule out external components.  I
> see the same problem with the connection to 15002 being a problem.  I had a
> 1Gig copper connection on this server as well and migrated my network to  a
> completely different nic and that did not help either.
>
> This is really a bizarre one that I can't seem to find the cause for.  Any
> other things you guys think might help me troubleshoot this problem?
>
> Thanks,
> -J
>
>
> On Fri, Nov 15, 2013 at 4:05 AM, Jonathan Barber <
> jonathan.barber at gmail.com> wrote:
>
>> On 15 November 2013 03:18, Jagga Soorma <jagga13 at gmail.com> wrote:
>>
>>> I changed the log level and here is what I see on the server:
>>>
>>> Looks like it is intermittently having issues connecting to port 15002
>>> on the client.  This client was just fine under the 2.5.9 torque production
>>> environment that we have but seems to be intermittently having issues in
>>> the 2.5.13 test environment that is setup with gpu support.
>>>
>>> [snip]
>>
>>>
>>> 11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
>>> setting job 7352.server1.xxx.com state from QUEUED-QUEUED to
>>> RUNNING-PRERUN (4-40)
>>> 11/14/2013 19:15:20;0008;PBS_Server;Job;7352.server1.xxx.com;forking in
>>> send_job
>>>
>>> *11/14/2013 19:15:20;0004;PBS_Server;Svr;svr_connect;attempting connect
>>> to host 72.34.135.64 port 1500211/14/2013
>>> 19:15:20;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002
>>> - cannot establish connection () - time=0 seconds*
>>>
>>> *11/14/2013 19:15:22;0004;PBS_Server;Svr;svr_connect;attempting connect
>>> to host 72.34.135.64 port 1500211/14/2013
>>> 19:15:22;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002
>>> - cannot establish connection () - time=0 seconds*
>>> 11/14/2013 19:15:22;0008;PBS_Server;Job;7352.server1.xxx.com;entering
>>> post_sendmom
>>>
>>
>> You might be running up against limits on the number of file descriptors
>> the pbs_server process or the OS is allowed to have open. You can use tools
>> such as lsof to see how many files the pbs_server has open:
>> $ sudo lsof -c pbs_server
>>
>> It's also possible that you're running out of ports to bind to. Running
>> lsof/netstat and looking to see if there are massive numbers of
>> connections/files open will reveal this.
>>
>> Although you say there is no firewall configured on the servers, do you
>> know if there a firewall between the pbs_server and the nodes?
>>
>> You can do a simple TCP connect to the mom to see if it's listening:
>> $ nmap -p 15002 ava01.grid.fe.up.pt -oG -
>> # Nmap 6.40 scan initiated Fri Nov 15 11:52:17 2013 as: nmap -p 15002 -oG
>> - ava01.grid.fe.up.pt
>> Host: 192.168.147.1 (ava01.grid.fe.up.pt) Status: Up
>> Host: 192.168.147.1 (ava01.grid.fe.up.pt) Ports:
>> 15002/open/tcp//unknown///
>> # Nmap done at Fri Nov 15 11:52:17 2013 -- 1 IP address (1 host up)
>> scanned in 0.04 seconds
>> $
>>
>> Or continuously with hping3 (I'm sure there are other tools that will do
>> this as well):
>> $ sudo hping3 -S -p 15002 ava01.grid.fe.up.pt
>> HPING ava01.grid.fe.up.pt (em1 192.168.147.1): S set, 40 headers + 0
>> data bytes
>> len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=0
>> win=14600 rtt=1.5 ms
>> len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=1
>> win=14600 rtt=0.8 ms
>> len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=2
>> win=14600 rtt=0.6 ms
>> len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=3
>> win=14600 rtt=1.0 ms
>> len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=4
>> win=14600 rtt=1.2 ms
>>
>> (SA means it's open)
>>
>> HTH
>> --
>> Jonathan Barber <jonathan.barber at gmail.com>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
Ken Nielson
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
www.adaptivecomputing.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131118/e008c02a/attachment-0001.html 


More information about the torqueusers mailing list