[torquedev] pbs_server segfault in req_delete.c

Joshua Bernstein jbernstein at penguincomputing.com
Tue Dec 23 16:47:38 MST 2008

Hello TORQUE Fans!

	Remember me? I figured I'd drop one more observed and repeatable 
segfault before we all went on a break for the holidays. This time 
though it seems to be inside of pbs_server. I'm running on X86_64, and 
I've been able to reproduce this problem in both version 2.3.3 and the 
brand shinny new 2.3.6.

	Essentially, if you issue a qdel -p to clear the queue of stale jobs, 
pbs_server appears to continue to operate normally, but shortly after 
new jobs get submitted to the queue, pbs_server posts this message and dies.

Assertion failed, bad pointer in link: file "req_delete.c", line 844
Aborted (core dumped)

I've been able to reproduce this by submitting jobs (a simple echo 
"HELLO") out of a directory that isn't known to pbs_mom. (ie: something 
not listed in mom_priv/config). In my case I just use /tmp on the 
headnode. This causes the job to enter the "E", or exiting state and 
thus hang out in the queue until the remote copy times out. At this 
point I issue the qdel:

$ qselect | xargs qdel -p

The key here seems to be that the qdel -p must be issued while there are 
jobs in the "E" state. Otherwise, I cannot seem to generate the crash.

After all of the jobs have been cleared from the queue, I submit a few 
more jobs. These jobs get queued, and run without too much issue, until 
pbs_server begins to properly report the stray jobs:

PBS_Server: sync_node_jobs, stray job 515.goldstar.penguincomputing.com 
found on n0
PBS_Server: sync_node_jobs, stray job 516.goldstar.penguincomputing.com 
found on n0
PBS_Server: sync_node_jobs, stray job 315.goldstar.penguincomputing.com 
found on n2
PBS_Server: sync_node_jobs, stray job 521.goldstar.penguincomputing.com 
found on n2

Shortly after this time, say 5 minutes or so, pbs_server dies with:

Assertion failed, bad pointer in link: file "req_delete.c", line 844

So what does the backtrace look like:

#0  0x000000328262e25d in raise () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x000000328262e25d in raise () from /lib64/tls/libc.so.6
#1  0x000000328262fa5e in abort () from /lib64/tls/libc.so.6
#2  0x0000002a9559f64b in get_next (pl={ll_prior = 0x0, ll_next = 0x0, 
= 0x0}, file=0x440874 "req_delete.c", line=844)
     at ../Libifl/list_link.c:372
#3  0x000000000041b3c8 in remove_job_delete_nanny ()
#4  0x000000000041cf03 in on_job_exit ()
#5  0x000000000043447b in dispatch_task ()
#6  0x000000000040a245 in process_Dreply ()
#7  0x0000002a955a6ac8 in wait_request (waittime=24, SState=0x568c98) at
#8  0x0000000000417e57 in main ()

I haven't had too much time to dig through the code yet and I'm less 
familiar with the pbs_server code then pbs_mom. Looks like I'm going to 
have to learn quickly. Anyway, I figured I'd throw it out there for 
discussion. True, it seems like a weird usage case, and something rare 
to actually happen, but it does seem like a real bug.

Just to summarize:

Version: 2.3.6 and 2.3.3
Arch:    X86_64
Configure: ./configure --prefix=%{torque_prefix} 
--with-server-home=%{serverspooldir} --libdir=%{_libdir} 
--disable-gcc-warnings --with-debug

-Joshua Bernstein
Software Engineer
Penguin Computing

More information about the torquedev mailing list