[torqueusers] Segfault is pbs_sched in next_job()

Joshua Bernstein jbernstein at penguincomputing.com
Thu Sep 20 00:57:25 MDT 2007


Hello All,

	In working at a customers site today, I ran into an issue where  
pbs_sched would segfault and die after a relatively short period of  
time. A large amount of jobs are in the queued and several jobs run  
nicely until pbs_sched segfaults. I was running TORQUE 2.1.8, and  
after the segfault I upgraded pbs_sched to version 2.1.9 and I still  
see the same behavior.

	According to the segfault below, it occurs in last_job++, which  
simply bumps a integer. I'm curious how simply bumping an int could  
cause a segfault. last_job is a integer offset into an array, but  
even if it was an array out of bounds error, I don't think the  
segfault would occur until the array was accessed.

	Below is a backtrace from GDB, and I've attached the core file. Its  
about 920k or so. I'd appreciate any help with this:

---SNIP---
[root at scyld sched_priv]# gdb pbs_sched core.13973
GNU gdb Red Hat Linux (6.3.0.0-1.143.el4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and  
you are
welcome to change it and/or distribute copies of it under certain  
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for  
details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host  
libthread_db library "/lib64/tls/libthread_db.so.1".

Core was generated by `pbs_sched'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /usr/lib64/libtorque.so.0...done.
Loaded symbols for /usr/lib64/libtorque.so.0
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /lib64/libnss_beo.so.2...done.
Loaded symbols for /lib64/libnss_beo.so.2
Reading symbols from /usr/lib64/libbeoconfig.so.0...done.
Loaded symbols for /usr/lib64/libbeoconfig.so.0
Reading symbols from /lib64/libnss_bproc.so.2...done.
Loaded symbols for /lib64/libnss_bproc.so.2
Reading symbols from /usr/lib64/libbproc.so.2...done.
Loaded symbols for /usr/lib64/libbproc.so.2
#0  0x0000000000404cfd in next_job (sinfo=0x646c61636f6c2e64, init=0)  
at fifo.c:689
689           last_job++;
(gdb) bt
#0  0x0000000000404cfd in next_job (sinfo=0x646c61636f6c2e64, init=0)  
at fifo.c:689
#1  0x00000000004052de in scheduling_cycle (sd=1767992687) at fifo.c:432
#2  0x000000000040455d in main (argc=Variable "argc" is not available.
) at pbs_sched.c:1036
---END SNIP---

-Joshua Bernstein
Software Engineer
Penguin Computing


More information about the torqueusers mailing list