[torqueusers] pbs_mom segfault in TMomCheckJobChild

Joshua Bernstein jbernstein at penguincomputing.com
Wed Dec 10 02:01:41 MST 2008


Hello Torque World!

	I'm working on a small cluster with three compute nodes at a large  
customer site. Each node is x86_64 Intel Quad cores, each with 8GB of  
RAM. The system is running CentOS 4 Update 6. Originally I was  
running TORQUE version 2.3.3, but then later updated to 2.3.5. The  
workload is such, that there are many many hundreds of jobs in the  
cluster and each jobs is very short lived, and runs for say 2 seconds.

	While investigating some error messages in the logs, I began poking  
around the code and found this little section in src/lib/Libnet/ 
net_server:

Inside and towards the end of wait_request reads:

---SNIP---
...
       else
         {
         FD_CLR(i, &readset);
         close(i);
	num_connections--;  /* added by CRI - should this be here? */
...
---END SNIP---

This of course doesn't make sense. Instead that close call should  
call close_conn() and remove the need for decrementing num_connection  
and thus read:

---SNIP---
...
	else
         {
         FD_CLR(i, &readset);
         close_conn(i);
...
---END SNIP---

So I made the change, and rebuilt the code, only to find after a bit  
of running jobs nicely, pbs_mom segfaults! I figured perhaps there  
were some other places this mistake had been made, and on a hunch  
downloaded the CVS snap shot from Monday (torque-2.4.0-snap. 
200806031058.tar.gz). It turns out that in the CVS snapshot this  
correction had already been made! So I rebuilt that snapshot with  
debugging symbols enabled (--with-debug), and once again after  
running several jobs ( say around 100 ), quite nicely, I see that  
pbs_mom segfaults and reports this in the syslog:

Dec  9 18:21:06 10.1.1.101 n1 pbs_mom[8638]: segfault at  
0000000000000001 rip 000000000042addf rsp 0000007fbfffdab0 error 6
Dec  9 18:22:23 10.1.1.100 n0 pbs_mom[5818]: segfault at  
0000000000000001 rip 000000000042addf rsp 0000007fbfffdab0 error 6
Dec  9 18:23:38 10.1.1.102 n2 pbs_mom[8739]: segfault at  
0000000000000002 rip 000000000042addf rsp 0000007fbfffdab0 error 6

I've also seen that error number at the end change, sometimes its a  
4. I'm not sure the difference.

Dec  9 17:12:10 10.1.1.102 n2 pbs_mom[18801]: segfault at  
00000000015f32a8 rip 000000000041695f rsp 0000007fbffffb80 error 4
Dec  9 17:15:19 10.1.1.101 n1 pbs_mom[18670]: segfault at  
0000000000000100 rip 000000000041688e rsp 0000007fbffffb80 error 4
Dec  9 17:21:54 10.1.1.100 n0 pbs_mom[15867]: segfault at  
0000000000000100 rip 000000000041688e rsp 0000007fbffffb80 error 4

After making some changes to our node startup scripts, I was able to  
generate a core file (roughly 103 MB) and generate a back trace to  
find the segfault happening in TMomCheckJobChild():

[root at solid0010 ~]# gdb /usr/sbin/pbs_mom core.2
...
Core was generated by `/usr/sbin/pbs_mom -c /var/spool/torque/ 
mom_priv/config -l /var/spool/torque/mom'.
....
(gdb) bt
#0  0x000000000042addf in TMomCheckJobChild ()
#1  0x000000000042990d in start_exec ()
#2  0x000000000042ea5a in req_commit ()
#3  0x0000000000431233 in dispatch_request ()
#4  0x000000000043115e in process_request ()
#5  0x0000002a955b0cd5 in wait_request (waittime=1, SState=0x0) at ../ 
Libnet/net_server.c:475
#6  0x000000000041b00e in main_loop ()
#7  0x000000000041b1f8 in main ()

I would appreciate any help in continuing to debug the matter as I  
really don't know where to go next. Without divulging too much, I can  
say that time is of the utmost importance in making some progress  
here, so any suggestions (short of switching resource managers :-) ),  
would be greatly appreciated. Feel free to contact me directly if  
anybody is interested in a copy of the binary and core files!

-Joshua Bernstein
Software Engineer
Penguin Computing












More information about the torqueusers mailing list