[torqueusers] Torque behaving badly

Jagga Soorma jagga13 at gmail.com
Fri Nov 15 13:11:29 MST 2013


So, this is a brand new install of torque without anything running on the
server/client except the torque processes.  I checked and I don't think the
server is running into any process limits.

I setup the server & sched processes on the client itself and now am
running everything on the client host to rule out external components.  I
see the same problem with the connection to 15002 being a problem.  I had a
1Gig copper connection on this server as well and migrated my network to  a
completely different nic and that did not help either.

This is really a bizarre one that I can't seem to find the cause for.  Any
other things you guys think might help me troubleshoot this problem?

Thanks,
-J


On Fri, Nov 15, 2013 at 4:05 AM, Jonathan Barber
<jonathan.barber at gmail.com>wrote:

> On 15 November 2013 03:18, Jagga Soorma <jagga13 at gmail.com> wrote:
>
>> I changed the log level and here is what I see on the server:
>>
>> Looks like it is intermittently having issues connecting to port 15002 on
>> the client.  This client was just fine under the 2.5.9 torque production
>> environment that we have but seems to be intermittently having issues in
>> the 2.5.13 test environment that is setup with gpu support.
>>
>> [snip]
>
>>
>> 11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
>> setting job 7352.server1.xxx.com state from QUEUED-QUEUED to
>> RUNNING-PRERUN (4-40)
>> 11/14/2013 19:15:20;0008;PBS_Server;Job;7352.server1.xxx.com;forking in
>> send_job
>>
>> *11/14/2013 19:15:20;0004;PBS_Server;Svr;svr_connect;attempting connect
>> to host 72.34.135.64 port 1500211/14/2013
>> 19:15:20;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002
>> - cannot establish connection () - time=0 seconds*
>>
>> *11/14/2013 19:15:22;0004;PBS_Server;Svr;svr_connect;attempting connect
>> to host 72.34.135.64 port 1500211/14/2013
>> 19:15:22;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002
>> - cannot establish connection () - time=0 seconds*
>> 11/14/2013 19:15:22;0008;PBS_Server;Job;7352.server1.xxx.com;entering
>> post_sendmom
>>
>
> You might be running up against limits on the number of file descriptors
> the pbs_server process or the OS is allowed to have open. You can use tools
> such as lsof to see how many files the pbs_server has open:
> $ sudo lsof -c pbs_server
>
> It's also possible that you're running out of ports to bind to. Running
> lsof/netstat and looking to see if there are massive numbers of
> connections/files open will reveal this.
>
> Although you say there is no firewall configured on the servers, do you
> know if there a firewall between the pbs_server and the nodes?
>
> You can do a simple TCP connect to the mom to see if it's listening:
> $ nmap -p 15002 ava01.grid.fe.up.pt -oG -
> # Nmap 6.40 scan initiated Fri Nov 15 11:52:17 2013 as: nmap -p 15002 -oG
> - ava01.grid.fe.up.pt
> Host: 192.168.147.1 (ava01.grid.fe.up.pt) Status: Up
> Host: 192.168.147.1 (ava01.grid.fe.up.pt) Ports:
> 15002/open/tcp//unknown///
> # Nmap done at Fri Nov 15 11:52:17 2013 -- 1 IP address (1 host up)
> scanned in 0.04 seconds
> $
>
> Or continuously with hping3 (I'm sure there are other tools that will do
> this as well):
> $ sudo hping3 -S -p 15002 ava01.grid.fe.up.pt
> HPING ava01.grid.fe.up.pt (em1 192.168.147.1): S set, 40 headers + 0 data
> bytes
> len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=0
> win=14600 rtt=1.5 ms
> len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=1
> win=14600 rtt=0.8 ms
> len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=2
> win=14600 rtt=0.6 ms
> len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=3
> win=14600 rtt=1.0 ms
> len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=4
> win=14600 rtt=1.2 ms
>
> (SA means it's open)
>
> HTH
> --
> Jonathan Barber <jonathan.barber at gmail.com>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131115/e8d4c1f0/attachment-0001.html 


More information about the torqueusers mailing list