[torqueusers] Torque 2.1.x pbs_server process hogging cpu
Martin Schafföner
martin.schaffoener at e-technik.uni-magdeburg.de
Wed Jun 14 01:59:25 MDT 2006
On Wednesday 14 June 2006 00:58, garrick at speculation.org wrote:
> Yes, you are right. That is an infinite loop. But why is connect()
> failing with EADDRNOTAVAIL? "The specified address is not available on
> the remote machine." I don't know what that means. Why would to
> attempt a connection to any machine other than one with the specified
> IP?
>
> Something wonky in your route table? Are you routing a network to
> yourself without actually configuring the IP to your interface?
Hm, never debugged routing tables before, but a quick look reveals nothing
spectacular:
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
192.168.100.0 * 255.255.255.0 U 0 0 0 eth1
FET-IPE-xx * 255.255.255.0 U 0 0 0 eth0
192.168.1.0 * 255.255.255.0 U 0 0 0 bond0
link-local * 255.255.0.0 U 0 0 0 eth0
loopback * 255.0.0.0 U 0 0 0 lo
default xxx.xx.xx.xxx 0.0.0.0 UG 0 0 0 eth0
eth0 is for outbound connections, bond0 is inbound towards nodes and eth1 is
inbound for misc stuff (ups, myrinet monitoring, etc.)
> Can you try another build without BIND_RESVPORT?
Did that, and pbs_server behaves as expected. Note that I only changed the
server side, the mom was always from 2.1.1-snap
> > (correctly) eliminates the variable. So, shouldn't line 241
> > (local.sin_port = htons(tryport);) be moved to line 226, just above the
> > #ifdef
> > HAVE_BINDRESVPORT?
>
> No, sin_port should be 0 for bindresvport() to work correctly.
Okay, thanks for the lesson.
So I debugged the behavior of pbs_server from a HAVE_BINDRESVPORT build when
submitting a job. Here's the log with some comments ####:
#### Startup
Breakpoint 2, client_to_svr (hostaddr=2368486806, port=15004, local_port=1) at
net_client.c:177
177 int one = 1;
(gdb) c
Continuing.
PBS_Server: Connection refused (111) in contact_sched, Could not contact
Scheduler - port 15004
[Thread debugging using libthread_db enabled]
[New Thread 1075251424 (LWP 15704)]
[Switching to Thread 1075251424 (LWP 15704)]
#### Job submitted, obviously trying to contact scheduler
#### as "set server scheduling=true"
Breakpoint 2, client_to_svr (hostaddr=2368486806, port=15004, local_port=1) at
net_client.c:177
177 int one = 1;
(gdb) c
Continuing.
PBS_Server: Connection refused (111) in contact_sched, Could not contact
Scheduler - port 15004
#### A little later, the scheduler seems to be asking pbs_server
#### to start the job
Breakpoint 2, client_to_svr (hostaddr=3232235784, port=15002, local_port=1) at
net_client.c:177
177 int one = 1;
(gdb) where
#0 client_to_svr (hostaddr=3232235784, port=15002, local_port=1) at
net_client.c:177
#1 0x0806a7f8 in svr_connect (hostaddr=3232235784, port=15002, func=0x804f060
<process_Dreply>, cntype=ToServerDIS) at svr_connect.c:175
#2 0x08068c9e in stat_to_mom (pjob=0x80c5208, cntl=0x80c59f8) at
req_stat.c:496
#3 0x08068d7f in stat_mom_job (pjob=0x1) at req_stat.c:621
#4 0x08066990 in post_sendmom (pwt=0x80c5848) at req_runjob.c:1052
#5 0x0806ecc5 in dispatch_task (ptask=0x80c5848) at svr_task.c:198
#6 0x08058860 in next_task () at pbsd_main.c:1204
#7 0x0805973a in main (argc=3, argv=0xbfffe724) at pbsd_main.c:963
(gdb) c
Continuing.
#### Job starts running
#### A few seconds later, we get into the forever-looping call:
Breakpoint 2, client_to_svr (hostaddr=3232235784, port=15002, local_port=1) at
net_client.c:177
177 int one = 1;
(gdb) where
#0 client_to_svr (hostaddr=3232235784, port=15002, local_port=1) at
net_client.c:177
#1 0x0806a7f8 in svr_connect (hostaddr=3232235784, port=15002, func=0x804f060
<process_Dreply>, cntype=ToServerDIS) at svr_connect.c:175
#2 0x0804f760 in relay_to_mom (momaddr=1, request=0x80f99c8, func=0x1) at
issue_request.c:143
#3 0x080614e0 in req_modifyjob (preq=0x80f99c8) at req_modify.c:305
#4 0x0805a87a in process_request (sfds=14) at process_request.c:494
#5 0x400337a0 in wait_request (waittime=1, SState=0x808a2bc) at
net_server.c:312
#6 0x08059b6d in main (argc=3, argv=0xbfffe724) at pbsd_main.c:1025
(gdb) n
179 local.sin_family = AF_INET;
(gdb)
180 local.sin_addr.s_addr = 0;
(gdb)
181 local.sin_port = 0;
(gdb)
183 tryport = IPPORT_RESERVED - 1;
(gdb)
189 sock = socket(AF_INET,SOCK_STREAM,0);
(gdb)
191 if (sock < 0)
(gdb) print tryport
No symbol "tryport" in current context.
(gdb) n
189 sock = socket(AF_INET,SOCK_STREAM,0);
(gdb)
191 if (sock < 0)
(gdb)
196 if (sock >= PBS_NET_MAX_CONNECTIONS)
(gdb)
205 flags = fcntl(sock,F_GETFL);
(gdb)
207 fcntl(sock,F_SETFL,flags);
(gdb)
206 flags |= O_NONBLOCK;
(gdb)
207 fcntl(sock,F_SETFL,flags);
(gdb)
214 if (local_port != FALSE)
(gdb)
218 setsockopt(
(gdb)
228 if (bindresvport(sock,&local) < 0)
(gdb)
276 remote.sin_family = AF_INET;
(gdb)
274 remote.sin_addr.s_addr = htonl(hostaddr);
(gdb)
275 remote.sin_port = htons((unsigned short)port);
(gdb) print port
$1 = 15002
(gdb) n
278 if (connect(sock,(struct sockaddr *)&remote,sizeof(remote)) >= 0)
(gdb)
294 switch (errno)
(gdb)
299 if (local_port != FALSE)
(gdb)
303 close(sock);
(gdb)
189 sock = socket(AF_INET,SOCK_STREAM,0);
#### We're back in the goto game
Regards,
--
Martin Schafföner
Cognitive Systems Group, Institute of Electronics, Signal Processing and
Communication Technologies, Department of Electrical Engineering,
Otto-von-Guericke University Magdeburg
Phone: +49 391 6720063
More information about the torqueusers
mailing list