[torqueusers] Torque 2.1.x pbs_server process hogging cpu

Martin Schafföner martin.schaffoener at e-technik.uni-magdeburg.de
Wed Jun 14 01:59:25 MDT 2006


On Wednesday 14 June 2006 00:58, garrick at speculation.org wrote:

> Yes, you are right.  That is an infinite loop.  But why is connect()
> failing with EADDRNOTAVAIL?   "The specified address is not available on
> the remote machine."  I don't know what that means.  Why would to
> attempt a connection to any machine other than one with the specified
> IP?
>
> Something wonky in your route table?  Are you routing a network to
> yourself without actually configuring the IP to your interface?

Hm, never debugged routing tables before, but a quick look reveals nothing 
spectacular:

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.100.0   *               255.255.255.0   U     0      0        0 eth1
FET-IPE-xx      *               255.255.255.0   U     0      0        0 eth0
192.168.1.0     *               255.255.255.0   U     0      0        0 bond0
link-local      *               255.255.0.0     U     0      0        0 eth0
loopback        *               255.0.0.0       U     0      0        0 lo
default         xxx.xx.xx.xxx   0.0.0.0         UG    0      0        0 eth0

eth0 is for outbound connections, bond0 is inbound towards nodes and eth1 is 
inbound for misc stuff (ups, myrinet monitoring, etc.)

> Can you try another build without BIND_RESVPORT?

Did that, and pbs_server behaves as expected. Note that I only changed the 
server side, the mom was always from 2.1.1-snap

> > (correctly) eliminates the variable. So, shouldn't line 241
> > (local.sin_port = htons(tryport);) be moved to line 226, just above the
> > #ifdef
> > HAVE_BINDRESVPORT?
>
> No, sin_port should be 0 for bindresvport() to work correctly.

Okay, thanks for the lesson.

So I debugged the behavior of pbs_server from a HAVE_BINDRESVPORT build when 
submitting a job. Here's the log with some comments ####:

#### Startup
Breakpoint 2, client_to_svr (hostaddr=2368486806, port=15004, local_port=1) at 
net_client.c:177
177       int               one = 1;
(gdb) c
Continuing.
PBS_Server: Connection refused (111) in contact_sched, Could not contact 
Scheduler - port 15004
[Thread debugging using libthread_db enabled]
[New Thread 1075251424 (LWP 15704)]
[Switching to Thread 1075251424 (LWP 15704)]

#### Job submitted, obviously trying to contact scheduler
#### as "set server scheduling=true"
Breakpoint 2, client_to_svr (hostaddr=2368486806, port=15004, local_port=1) at 
net_client.c:177
177       int               one = 1;
(gdb) c
Continuing.
PBS_Server: Connection refused (111) in contact_sched, Could not contact 
Scheduler - port 15004

#### A little later, the scheduler seems to be asking pbs_server
#### to start the job
Breakpoint 2, client_to_svr (hostaddr=3232235784, port=15002, local_port=1) at 
net_client.c:177
177       int               one = 1;
(gdb) where
#0  client_to_svr (hostaddr=3232235784, port=15002, local_port=1) at 
net_client.c:177
#1  0x0806a7f8 in svr_connect (hostaddr=3232235784, port=15002, func=0x804f060 
<process_Dreply>, cntype=ToServerDIS) at svr_connect.c:175
#2  0x08068c9e in stat_to_mom (pjob=0x80c5208, cntl=0x80c59f8) at 
req_stat.c:496
#3  0x08068d7f in stat_mom_job (pjob=0x1) at req_stat.c:621
#4  0x08066990 in post_sendmom (pwt=0x80c5848) at req_runjob.c:1052
#5  0x0806ecc5 in dispatch_task (ptask=0x80c5848) at svr_task.c:198
#6  0x08058860 in next_task () at pbsd_main.c:1204
#7  0x0805973a in main (argc=3, argv=0xbfffe724) at pbsd_main.c:963
(gdb) c
Continuing.
#### Job starts running

#### A few seconds later, we get into the forever-looping call:
Breakpoint 2, client_to_svr (hostaddr=3232235784, port=15002, local_port=1) at 
net_client.c:177
177       int               one = 1;
(gdb) where
#0  client_to_svr (hostaddr=3232235784, port=15002, local_port=1) at 
net_client.c:177
#1  0x0806a7f8 in svr_connect (hostaddr=3232235784, port=15002, func=0x804f060 
<process_Dreply>, cntype=ToServerDIS) at svr_connect.c:175
#2  0x0804f760 in relay_to_mom (momaddr=1, request=0x80f99c8, func=0x1) at 
issue_request.c:143
#3  0x080614e0 in req_modifyjob (preq=0x80f99c8) at req_modify.c:305
#4  0x0805a87a in process_request (sfds=14) at process_request.c:494
#5  0x400337a0 in wait_request (waittime=1, SState=0x808a2bc) at 
net_server.c:312
#6  0x08059b6d in main (argc=3, argv=0xbfffe724) at pbsd_main.c:1025
(gdb) n
179       local.sin_family = AF_INET;
(gdb)
180       local.sin_addr.s_addr = 0;
(gdb)
181       local.sin_port = 0;
(gdb)
183       tryport = IPPORT_RESERVED - 1;
(gdb)
189       sock = socket(AF_INET,SOCK_STREAM,0);
(gdb)
191       if (sock < 0)
(gdb) print tryport
No symbol "tryport" in current context.
(gdb) n
189       sock = socket(AF_INET,SOCK_STREAM,0);
(gdb)
191       if (sock < 0)
(gdb)
196       if (sock >= PBS_NET_MAX_CONNECTIONS)
(gdb)
205       flags = fcntl(sock,F_GETFL);
(gdb)
207       fcntl(sock,F_SETFL,flags);
(gdb)
206       flags |= O_NONBLOCK;
(gdb)
207       fcntl(sock,F_SETFL,flags);
(gdb)
214       if (local_port != FALSE)
(gdb)
218         setsockopt(
(gdb)
228         if (bindresvport(sock,&local) < 0)
(gdb)
276       remote.sin_family = AF_INET;
(gdb)
274       remote.sin_addr.s_addr = htonl(hostaddr);
(gdb)
275       remote.sin_port = htons((unsigned short)port);
(gdb) print port
$1 = 15002
(gdb) n
278       if (connect(sock,(struct sockaddr *)&remote,sizeof(remote)) >= 0)
(gdb)
294       switch (errno)
(gdb)
299           if (local_port != FALSE)
(gdb)
303             close(sock);
(gdb)
189       sock = socket(AF_INET,SOCK_STREAM,0);
#### We're back in the goto game

Regards,
-- 
Martin Schafföner

Cognitive Systems Group, Institute of Electronics, Signal Processing and 
Communication Technologies, Department of Electrical Engineering, 
Otto-von-Guericke University Magdeburg
Phone: +49 391 6720063


More information about the torqueusers mailing list