[torquedev] pbs_mom segfault - 2.0.0p7
Garrick Staples
garrick at usc.edu
Fri Feb 24 12:33:41 MST 2006
Which version of TORQUE is this? There have been some changes that
could effect in the last few months.
I'm guessing that an alloc in the TM_SPAWN part of tm_request failed but
wasn't caught.
On Mon, Feb 13, 2006 at 06:38:06PM -0700, Marcus R. Epperson alleged:
> Today we saw a rank 0 pbs_mom segfault on a large job. Here is the
> backtrace:
>
> #0 0x000000301132e37d in raise () from /lib64/tls/libc.so.6
> #1 0x000000301132faae in abort () from /lib64/tls/libc.so.6
> #2 0x00000000004069af in catch_abort (sig=11) at mom_main.c:2791
> #3 <signal handler called>
> #4 0x00000030113688c5 in free () from /lib64/tls/libc.so.6
> #5 0x000000000040d0f1 in arrayfree (array=0x96acf0) at mom_comm.c:984
> #6 0x00000000004129d7 in tm_request (fd=11, version=1) at mom_comm.c:4463
> #7 0x00000000004094f7 in do_tcp (fd=11) at mom_main.c:4550
> #8 0x000000000040968c in tcp_request (fd=11) at mom_main.c:4634
> #9 0x00000000004303aa in wait_request (waittime=1, SState=0x0) at
> net_server.c:325
> #10 0x0000000000409969 in finish_loop (waittime=10) at mom_main.c:4776
> #11 0x000000000040bafb in main (argc=1, argv=0x7fbffffde8) at
> mom_main.c:5860
>
> And the last 10 lines of mom_logs for this node:
>
> 02/13/2006 13:27:06;0100; pbs_mom;Req;;Type StatusJob request received
> from PBS_Server at admin2, sock=11
> 02/13/2006 13:27:40;0008; pbs_mom;Job;58469.admin2;start_process: task
> started, tid 18878, sid 31833, cmd /bin/sh
> 02/13/2006 13:27:40;0008; pbs_mom;Job;58469.admin2;start_process: task
> started, tid 18879, sid 31834, cmd /bin/sh
> 02/13/2006 13:28:35;0008; pbs_mom;Job;58469.admin2;kill_task: killing pid
> 31833 task 18878 with sig 9
> 02/13/2006 13:28:40;0008; pbs_mom;Job;58469.admin2;kill_task: killing pid
> 31834 task 18879 with sig 9
> 02/13/2006 13:28:40;0080; pbs_mom;Job;58469.admin2;scan_for_terminated:
> job 58469.admin2 task 18878 terminated, sid 31833
> 02/13/2006 13:28:40;0080; pbs_mom;Job;58469.admin2;scan_for_terminated:
> job 58469.admin2 task 18879 terminated, sid 31834
> 02/13/2006 13:28:49;0008; pbs_mom;Job;58469.admin2;start_process: task
> started, tid 20594, sid 31902, cmd /bin/sh
> 02/13/2006 13:28:49;0008; pbs_mom;Job;58469.admin2;start_process: task
> started, tid 20595, sid 31903, cmd /bin/sh
> 02/13/2006 13:28:49;0001; pbs_mom;Svr;pbs_mom;Resource temporarily
> unavailable (11) in mom_main, Caught fatal core signal
>
> I think the "Resource temporarily unavailable" message is misleading since
> 11 is the signal it received, not an errno. I can work on some changes
> that will make it more clear.
>
> In any event, the main issue is the fact that the segfault happened in the
> first place. I will start looking through the code in more detail, but I
> thought I'd send this in case anyone has seen this already and knows of a
> fix.
>
> Thank you,
> -Marcus Epperson
>
>
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20060224/e132a295/attachment.bin
More information about the torquedev
mailing list