[torquedev] pbs_mom segfault - 2.0.0p7

Garrick Staples garrick at usc.edu
Fri Feb 24 12:33:41 MST 2006


Which version of TORQUE is this?  There have been some changes that
could effect in the last few months.

I'm guessing that an alloc in the TM_SPAWN part of tm_request failed but
wasn't caught.

On Mon, Feb 13, 2006 at 06:38:06PM -0700, Marcus R. Epperson alleged:
> Today we saw a rank 0 pbs_mom segfault on a large job.  Here is the 
> backtrace:
> 
> #0  0x000000301132e37d in raise () from /lib64/tls/libc.so.6
> #1  0x000000301132faae in abort () from /lib64/tls/libc.so.6
> #2  0x00000000004069af in catch_abort (sig=11) at mom_main.c:2791
> #3  <signal handler called>
> #4  0x00000030113688c5 in free () from /lib64/tls/libc.so.6
> #5  0x000000000040d0f1 in arrayfree (array=0x96acf0) at mom_comm.c:984
> #6  0x00000000004129d7 in tm_request (fd=11, version=1) at mom_comm.c:4463
> #7  0x00000000004094f7 in do_tcp (fd=11) at mom_main.c:4550
> #8  0x000000000040968c in tcp_request (fd=11) at mom_main.c:4634
> #9  0x00000000004303aa in wait_request (waittime=1, SState=0x0) at 
> net_server.c:325
> #10 0x0000000000409969 in finish_loop (waittime=10) at mom_main.c:4776
> #11 0x000000000040bafb in main (argc=1, argv=0x7fbffffde8) at 
> mom_main.c:5860
> 
> And the last 10 lines of mom_logs for this node:
> 
> 02/13/2006 13:27:06;0100;   pbs_mom;Req;;Type StatusJob request received 
> from PBS_Server at admin2, sock=11
> 02/13/2006 13:27:40;0008;   pbs_mom;Job;58469.admin2;start_process: task 
> started, tid 18878, sid 31833, cmd /bin/sh
> 02/13/2006 13:27:40;0008;   pbs_mom;Job;58469.admin2;start_process: task 
> started, tid 18879, sid 31834, cmd /bin/sh
> 02/13/2006 13:28:35;0008;   pbs_mom;Job;58469.admin2;kill_task: killing pid 
> 31833 task 18878 with sig 9
> 02/13/2006 13:28:40;0008;   pbs_mom;Job;58469.admin2;kill_task: killing pid 
> 31834 task 18879 with sig 9
> 02/13/2006 13:28:40;0080;   pbs_mom;Job;58469.admin2;scan_for_terminated: 
> job 58469.admin2 task 18878 terminated, sid 31833
> 02/13/2006 13:28:40;0080;   pbs_mom;Job;58469.admin2;scan_for_terminated: 
> job 58469.admin2 task 18879 terminated, sid 31834
> 02/13/2006 13:28:49;0008;   pbs_mom;Job;58469.admin2;start_process: task 
> started, tid 20594, sid 31902, cmd /bin/sh
> 02/13/2006 13:28:49;0008;   pbs_mom;Job;58469.admin2;start_process: task 
> started, tid 20595, sid 31903, cmd /bin/sh
> 02/13/2006 13:28:49;0001;   pbs_mom;Svr;pbs_mom;Resource temporarily 
> unavailable (11) in mom_main, Caught fatal core signal
> 
> I think the "Resource temporarily unavailable" message is misleading since 
> 11 is the signal it received, not an errno.  I can work on some changes 
> that will make it more clear.
> 
> In any event, the main issue is the fact that the segfault happened in the 
> first place.  I will start looking through the code in more detail, but I 
> thought I'd send this in case anyone has seen this already and knows of a 
> fix.
> 
> Thank you,
> -Marcus Epperson
> 
> 
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20060224/e132a295/attachment.bin


More information about the torquedev mailing list