[torqueusers] pbs_mom segfault in TMomCheckJobChild
Joshua Bernstein
jbernstein at penguincomputing.com
Wed Dec 10 02:01:41 MST 2008
Hello Torque World!
I'm working on a small cluster with three compute nodes at a large
customer site. Each node is x86_64 Intel Quad cores, each with 8GB of
RAM. The system is running CentOS 4 Update 6. Originally I was
running TORQUE version 2.3.3, but then later updated to 2.3.5. The
workload is such, that there are many many hundreds of jobs in the
cluster and each jobs is very short lived, and runs for say 2 seconds.
While investigating some error messages in the logs, I began poking
around the code and found this little section in src/lib/Libnet/
net_server:
Inside and towards the end of wait_request reads:
---SNIP---
...
else
{
FD_CLR(i, &readset);
close(i);
num_connections--; /* added by CRI - should this be here? */
...
---END SNIP---
This of course doesn't make sense. Instead that close call should
call close_conn() and remove the need for decrementing num_connection
and thus read:
---SNIP---
...
else
{
FD_CLR(i, &readset);
close_conn(i);
...
---END SNIP---
So I made the change, and rebuilt the code, only to find after a bit
of running jobs nicely, pbs_mom segfaults! I figured perhaps there
were some other places this mistake had been made, and on a hunch
downloaded the CVS snap shot from Monday (torque-2.4.0-snap.
200806031058.tar.gz). It turns out that in the CVS snapshot this
correction had already been made! So I rebuilt that snapshot with
debugging symbols enabled (--with-debug), and once again after
running several jobs ( say around 100 ), quite nicely, I see that
pbs_mom segfaults and reports this in the syslog:
Dec 9 18:21:06 10.1.1.101 n1 pbs_mom[8638]: segfault at
0000000000000001 rip 000000000042addf rsp 0000007fbfffdab0 error 6
Dec 9 18:22:23 10.1.1.100 n0 pbs_mom[5818]: segfault at
0000000000000001 rip 000000000042addf rsp 0000007fbfffdab0 error 6
Dec 9 18:23:38 10.1.1.102 n2 pbs_mom[8739]: segfault at
0000000000000002 rip 000000000042addf rsp 0000007fbfffdab0 error 6
I've also seen that error number at the end change, sometimes its a
4. I'm not sure the difference.
Dec 9 17:12:10 10.1.1.102 n2 pbs_mom[18801]: segfault at
00000000015f32a8 rip 000000000041695f rsp 0000007fbffffb80 error 4
Dec 9 17:15:19 10.1.1.101 n1 pbs_mom[18670]: segfault at
0000000000000100 rip 000000000041688e rsp 0000007fbffffb80 error 4
Dec 9 17:21:54 10.1.1.100 n0 pbs_mom[15867]: segfault at
0000000000000100 rip 000000000041688e rsp 0000007fbffffb80 error 4
After making some changes to our node startup scripts, I was able to
generate a core file (roughly 103 MB) and generate a back trace to
find the segfault happening in TMomCheckJobChild():
[root at solid0010 ~]# gdb /usr/sbin/pbs_mom core.2
...
Core was generated by `/usr/sbin/pbs_mom -c /var/spool/torque/
mom_priv/config -l /var/spool/torque/mom'.
....
(gdb) bt
#0 0x000000000042addf in TMomCheckJobChild ()
#1 0x000000000042990d in start_exec ()
#2 0x000000000042ea5a in req_commit ()
#3 0x0000000000431233 in dispatch_request ()
#4 0x000000000043115e in process_request ()
#5 0x0000002a955b0cd5 in wait_request (waittime=1, SState=0x0) at ../
Libnet/net_server.c:475
#6 0x000000000041b00e in main_loop ()
#7 0x000000000041b1f8 in main ()
I would appreciate any help in continuing to debug the matter as I
really don't know where to go next. Without divulging too much, I can
say that time is of the utmost importance in making some progress
here, so any suggestions (short of switching resource managers :-) ),
would be greatly appreciated. Feel free to contact me directly if
anybody is interested in a copy of the binary and core files!
-Joshua Bernstein
Software Engineer
Penguin Computing
More information about the torqueusers
mailing list