[torqueusers] Segfault is pbs_sched in next_job()

David Backeberg david.backeberg at case.edu
Thu Sep 20 19:52:13 MDT 2007


Have you considered Moab or Maui?

People generally choose one of these rather than pbs_sched. Among
other things, the M- schedulers are highly configurable, scale nicely,
and are heavily tested.

-Dave

On 9/20/07, Joshua Bernstein <jbernstein at penguincomputing.com> wrote:
> Hello All,
>
>         In working at a customers site today, I ran into an issue where
> pbs_sched would segfault and die after a relatively short period of
> time. A large amount of jobs are in the queued and several jobs run
> nicely until pbs_sched segfaults. I was running TORQUE 2.1.8, and
> after the segfault I upgraded pbs_sched to version 2.1.9 and I still
> see the same behavior.
>
>         According to the segfault below, it occurs in last_job++, which
> simply bumps a integer. I'm curious how simply bumping an int could
> cause a segfault. last_job is a integer offset into an array, but
> even if it was an array out of bounds error, I don't think the
> segfault would occur until the array was accessed.
>
>         Below is a backtrace from GDB, and I've attached the core file. Its
> about 920k or so. I'd appreciate any help with this:
>
> ---SNIP---
> [root at scyld sched_priv]# gdb pbs_sched core.13973
> GNU gdb Red Hat Linux (6.3.0.0-1.143.el4rh)
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and
> you are
> welcome to change it and/or distribute copies of it under certain
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for
> details.
> This GDB was configured as "x86_64-redhat-linux-gnu"...Using host
> libthread_db library "/lib64/tls/libthread_db.so.1".
>
> Core was generated by `pbs_sched'.
> Program terminated with signal 11, Segmentation fault.
> Reading symbols from /usr/lib64/libtorque.so.0...done.
> Loaded symbols for /usr/lib64/libtorque.so.0
> Reading symbols from /lib64/tls/libc.so.6...done.
> Loaded symbols for /lib64/tls/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> Reading symbols from /lib64/libnss_files.so.2...done.
> Loaded symbols for /lib64/libnss_files.so.2
> Reading symbols from /lib64/libnss_beo.so.2...done.
> Loaded symbols for /lib64/libnss_beo.so.2
> Reading symbols from /usr/lib64/libbeoconfig.so.0...done.
> Loaded symbols for /usr/lib64/libbeoconfig.so.0
> Reading symbols from /lib64/libnss_bproc.so.2...done.
> Loaded symbols for /lib64/libnss_bproc.so.2
> Reading symbols from /usr/lib64/libbproc.so.2...done.
> Loaded symbols for /usr/lib64/libbproc.so.2
> #0  0x0000000000404cfd in next_job (sinfo=0x646c61636f6c2e64, init=0)
> at fifo.c:689
> 689           last_job++;
> (gdb) bt
> #0  0x0000000000404cfd in next_job (sinfo=0x646c61636f6c2e64, init=0)
> at fifo.c:689
> #1  0x00000000004052de in scheduling_cycle (sd=1767992687) at fifo.c:432
> #2  0x000000000040455d in main (argc=Variable "argc" is not available.
> ) at pbs_sched.c:1036
> ---END SNIP---
>
> -Joshua Bernstein
> Software Engineer
> Penguin Computing
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>


More information about the torqueusers mailing list