[torqueusers] pbs_scheduler keeps dying

gianfranco sciacca gs at hep.ucl.ac.uk
Tue Nov 8 05:08:25 MST 2005


On Tue, 2005-10-18 at 19:13, Garrick Staples wrote:
> On Tue, Oct 18, 2005 at 03:58:35PM +0100, gianfranco sciacca alleged:
> > We have been running torque with its stock scheduler for about 7 months 
> > with little problems. All of a sudden, since a few days, the scheduler 
> > keeps dying which is seriously disrupting our cluster operation. I should 
> 
> Can you get a gdb backtrace of it dieing?  Or maybe run it under
> valgrind?

This is the first crash under valgrind. I append the log below.

cheers,
gianfranco

==2411== Memcheck, a memory error detector.
==2411== Copyright (C) 2002-2005, and GNU GPL'd, by Julian Seward et al.
==2411== Using LibVEX rev 1367, a library for dynamic binary
translation.
==2411== Copyright (C) 2004-2005, and GNU GPL'd, by OpenWorks LLP.
==2411== Using valgrind-3.0.1, a dynamic binary instrumentation
framework.
==2411== Copyright (C) 2000-2005, and GNU GPL'd, by Julian Seward et al.
==2411==
==2411== My PID = 2411, parent PID = 2410.  Prog and args are:
==2411==    /usr/pbs/sbin/pbs_sched
==2411== For more details, rerun with: -v
==2411==
==2411==
==2411== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 26 from 1)
==2411== malloc/free: in use at exit: 3862 bytes in 43 blocks.
==2411== malloc/free: 153 allocs, 110 frees, 104332 bytes allocated.
==2411== For counts of detected errors, rerun with: -v
==2411== searching for pointers to 43 not-freed blocks.
==2411== checked 110340 bytes.
==2411==
==2411== LEAK SUMMARY:
==2411==    definitely lost: 0 bytes in 0 blocks.
==2411==      possibly lost: 0 bytes in 0 blocks.
==2411==    still reachable: 3862 bytes in 43 blocks.
==2411==         suppressed: 0 bytes in 0 blocks.
==2411== Reachable blocks (those to which a pointer was found) are not
shown.
==2411== To see them, rerun with: --show-reachable=yes
==2413== Invalid read of size 4
==2413==    at 0x805A6FB: pbs_rescquery (pbsD_resc.c:207)
==2413==    by 0x8053B0C: check_nodes (check.c:484)
==2413==    by 0x805364E: is_ok_to_run_job (check.c:174)
==2413==    by 0x804BE3C: scheduling_cycle (fifo.c:412)
==2413==    by 0x804BC4D: schedule (fifo.c:346)
==2413==    by 0x804B5B2: main (pbs_sched.c:1007)
==2413==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==2413==
==2413== Process terminating with default action of signal 11 (SIGSEGV)
==2413==  Access not within mapped region at address 0x0
==2413==    at 0x805A6FB: pbs_rescquery (pbsD_resc.c:207)
==2413==    by 0x8053B0C: check_nodes (check.c:484)
==2413==    by 0x805364E: is_ok_to_run_job (check.c:174)
==2413==    by 0x804BE3C: scheduling_cycle (fifo.c:412)
==2413==    by 0x804BC4D: schedule (fifo.c:346)
==2413==    by 0x804B5B2: main (pbs_sched.c:1007)
==2413==
==2413== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 26 from 1)
==2413== malloc/free: in use at exit: 183861 bytes in 4495 blocks.
==2413== malloc/free: 110522960 allocs, 110518465 frees, 546853114 bytes
allocated.
==2413== For counts of detected errors, rerun with: -v
==2413== searching for pointers to 4495 not-freed blocks.
==2413== checked 274704 bytes.
==2413==
==2413== LEAK SUMMARY:
==2413==    definitely lost: 22066 bytes in 2636 blocks.
==2413==      possibly lost: 0 bytes in 0 blocks.
==2413==    still reachable: 161795 bytes in 1859 blocks.
==2413==         suppressed: 0 bytes in 0 blocks.
==2413== Use --leak-check=full to see details of leaked memory.




More information about the torqueusers mailing list