[torquedev] pbs_mom segfault - 2.0.0p7
Marcus R. Epperson
mrepper at sandia.gov
Mon Feb 13 18:38:06 MST 2006
Today we saw a rank 0 pbs_mom segfault on a large job. Here is the backtrace:
#0 0x000000301132e37d in raise () from /lib64/tls/libc.so.6
#1 0x000000301132faae in abort () from /lib64/tls/libc.so.6
#2 0x00000000004069af in catch_abort (sig=11) at mom_main.c:2791
#3 <signal handler called>
#4 0x00000030113688c5 in free () from /lib64/tls/libc.so.6
#5 0x000000000040d0f1 in arrayfree (array=0x96acf0) at mom_comm.c:984
#6 0x00000000004129d7 in tm_request (fd=11, version=1) at mom_comm.c:4463
#7 0x00000000004094f7 in do_tcp (fd=11) at mom_main.c:4550
#8 0x000000000040968c in tcp_request (fd=11) at mom_main.c:4634
#9 0x00000000004303aa in wait_request (waittime=1, SState=0x0) at net_server.c:325
#10 0x0000000000409969 in finish_loop (waittime=10) at mom_main.c:4776
#11 0x000000000040bafb in main (argc=1, argv=0x7fbffffde8) at mom_main.c:5860
And the last 10 lines of mom_logs for this node:
02/13/2006 13:27:06;0100; pbs_mom;Req;;Type StatusJob request received from PBS_Server at admin2, sock=11
02/13/2006 13:27:40;0008; pbs_mom;Job;58469.admin2;start_process: task started, tid 18878, sid 31833, cmd /bin/sh
02/13/2006 13:27:40;0008; pbs_mom;Job;58469.admin2;start_process: task started, tid 18879, sid 31834, cmd /bin/sh
02/13/2006 13:28:35;0008; pbs_mom;Job;58469.admin2;kill_task: killing pid 31833 task 18878 with sig 9
02/13/2006 13:28:40;0008; pbs_mom;Job;58469.admin2;kill_task: killing pid 31834 task 18879 with sig 9
02/13/2006 13:28:40;0080; pbs_mom;Job;58469.admin2;scan_for_terminated: job 58469.admin2 task 18878 terminated, sid 31833
02/13/2006 13:28:40;0080; pbs_mom;Job;58469.admin2;scan_for_terminated: job 58469.admin2 task 18879 terminated, sid 31834
02/13/2006 13:28:49;0008; pbs_mom;Job;58469.admin2;start_process: task started, tid 20594, sid 31902, cmd /bin/sh
02/13/2006 13:28:49;0008; pbs_mom;Job;58469.admin2;start_process: task started, tid 20595, sid 31903, cmd /bin/sh
02/13/2006 13:28:49;0001; pbs_mom;Svr;pbs_mom;Resource temporarily unavailable (11) in mom_main, Caught fatal core signal
I think the "Resource temporarily unavailable" message is misleading since 11 is the signal it received, not an errno. I can work on some changes that will make it more clear.
In any event, the main issue is the fact that the segfault happened in the first place. I will start looking through the code in more detail, but I thought I'd send this in case anyone has seen this already and knows of a fix.
Thank you,
-Marcus Epperson
More information about the torquedev
mailing list