[torquedev] pbs_mom segfault - 2.0.0p7

Marcus R. Epperson mrepper at sandia.gov
Mon Feb 13 18:38:06 MST 2006


Today we saw a rank 0 pbs_mom segfault on a large job.  Here is the backtrace:

#0  0x000000301132e37d in raise () from /lib64/tls/libc.so.6
#1  0x000000301132faae in abort () from /lib64/tls/libc.so.6
#2  0x00000000004069af in catch_abort (sig=11) at mom_main.c:2791
#3  <signal handler called>
#4  0x00000030113688c5 in free () from /lib64/tls/libc.so.6
#5  0x000000000040d0f1 in arrayfree (array=0x96acf0) at mom_comm.c:984
#6  0x00000000004129d7 in tm_request (fd=11, version=1) at mom_comm.c:4463
#7  0x00000000004094f7 in do_tcp (fd=11) at mom_main.c:4550
#8  0x000000000040968c in tcp_request (fd=11) at mom_main.c:4634
#9  0x00000000004303aa in wait_request (waittime=1, SState=0x0) at net_server.c:325
#10 0x0000000000409969 in finish_loop (waittime=10) at mom_main.c:4776
#11 0x000000000040bafb in main (argc=1, argv=0x7fbffffde8) at mom_main.c:5860

And the last 10 lines of mom_logs for this node:

02/13/2006 13:27:06;0100;   pbs_mom;Req;;Type StatusJob request received from PBS_Server at admin2, sock=11
02/13/2006 13:27:40;0008;   pbs_mom;Job;58469.admin2;start_process: task started, tid 18878, sid 31833, cmd /bin/sh
02/13/2006 13:27:40;0008;   pbs_mom;Job;58469.admin2;start_process: task started, tid 18879, sid 31834, cmd /bin/sh
02/13/2006 13:28:35;0008;   pbs_mom;Job;58469.admin2;kill_task: killing pid 31833 task 18878 with sig 9
02/13/2006 13:28:40;0008;   pbs_mom;Job;58469.admin2;kill_task: killing pid 31834 task 18879 with sig 9
02/13/2006 13:28:40;0080;   pbs_mom;Job;58469.admin2;scan_for_terminated: job 58469.admin2 task 18878 terminated, sid 31833
02/13/2006 13:28:40;0080;   pbs_mom;Job;58469.admin2;scan_for_terminated: job 58469.admin2 task 18879 terminated, sid 31834
02/13/2006 13:28:49;0008;   pbs_mom;Job;58469.admin2;start_process: task started, tid 20594, sid 31902, cmd /bin/sh
02/13/2006 13:28:49;0008;   pbs_mom;Job;58469.admin2;start_process: task started, tid 20595, sid 31903, cmd /bin/sh
02/13/2006 13:28:49;0001;   pbs_mom;Svr;pbs_mom;Resource temporarily unavailable (11) in mom_main, Caught fatal core signal

I think the "Resource temporarily unavailable" message is misleading since 11 is the signal it received, not an errno.  I can work on some changes that will make it more clear.

In any event, the main issue is the fact that the segfault happened in the first place.  I will start looking through the code in more detail, but I thought I'd send this in case anyone has seen this already and knows of a fix.

Thank you,
-Marcus Epperson




More information about the torquedev mailing list