[torqueusers] Torque behaving badly

Jonathan Barber jonathan.barber at gmail.com
Fri Nov 15 05:05:56 MST 2013


On 15 November 2013 03:18, Jagga Soorma <jagga13 at gmail.com> wrote:

> I changed the log level and here is what I see on the server:
>
> Looks like it is intermittently having issues connecting to port 15002 on
> the client.  This client was just fine under the 2.5.9 torque production
> environment that we have but seems to be intermittently having issues in
> the 2.5.13 test environment that is setup with gpu support.
>
> [snip]

>
> 11/14/2013 19:15:20;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
> setting job 7352.server1.xxx.com state from QUEUED-QUEUED to
> RUNNING-PRERUN (4-40)
> 11/14/2013 19:15:20;0008;PBS_Server;Job;7352.server1.xxx.com;forking in
> send_job
>
> *11/14/2013 19:15:20;0004;PBS_Server;Svr;svr_connect;attempting connect to
> host 72.34.135.64 port 1500211/14/2013
> 19:15:20;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002
> - cannot establish connection () - time=0 seconds*
>
> *11/14/2013 19:15:22;0004;PBS_Server;Svr;svr_connect;attempting connect to
> host 72.34.135.64 port 1500211/14/2013
> 19:15:22;0004;PBS_Server;Svr;svr_connect;cannot connect to host port 15002
> - cannot establish connection () - time=0 seconds*
> 11/14/2013 19:15:22;0008;PBS_Server;Job;7352.server1.xxx.com;entering
> post_sendmom
>

You might be running up against limits on the number of file descriptors
the pbs_server process or the OS is allowed to have open. You can use tools
such as lsof to see how many files the pbs_server has open:
$ sudo lsof -c pbs_server

It's also possible that you're running out of ports to bind to. Running
lsof/netstat and looking to see if there are massive numbers of
connections/files open will reveal this.

Although you say there is no firewall configured on the servers, do you
know if there a firewall between the pbs_server and the nodes?

You can do a simple TCP connect to the mom to see if it's listening:
$ nmap -p 15002 ava01.grid.fe.up.pt -oG -
# Nmap 6.40 scan initiated Fri Nov 15 11:52:17 2013 as: nmap -p 15002 -oG -
ava01.grid.fe.up.pt
Host: 192.168.147.1 (ava01.grid.fe.up.pt) Status: Up
Host: 192.168.147.1 (ava01.grid.fe.up.pt) Ports: 15002/open/tcp//unknown///
# Nmap done at Fri Nov 15 11:52:17 2013 -- 1 IP address (1 host up) scanned
in 0.04 seconds
$

Or continuously with hping3 (I'm sure there are other tools that will do
this as well):
$ sudo hping3 -S -p 15002 ava01.grid.fe.up.pt
HPING ava01.grid.fe.up.pt (em1 192.168.147.1): S set, 40 headers + 0 data
bytes
len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=0 win=14600
rtt=1.5 ms
len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=1 win=14600
rtt=0.8 ms
len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=2 win=14600
rtt=0.6 ms
len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=3 win=14600
rtt=1.0 ms
len=46 ip=192.168.147.1 ttl=61 DF id=0 sport=15002 flags=SA seq=4 win=14600
rtt=1.2 ms

(SA means it's open)

HTH
-- 
Jonathan Barber <jonathan.barber at gmail.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131115/36e7faab/attachment.html 


More information about the torqueusers mailing list