Bugzilla – Bug 36
qsub crashes with long depedency list
Last modified: 2009-12-08 16:22:12 MST
You need to log in before you can comment on or make changes to this bug.
qsub will consistently crash when a job is submitted that has a job dependency list between 62 and 75 jobs long. This could be caused by job count or by the size of the dependency string, I'm not sure. All the job IDs are 5 digits long so the limit in terms of characters in the list would be 372 to 450 characters (5 digits per job ID plus one for every colon), not counting "afterany". Fewer than 62 jobs work fine, and more than 75 jobs produces an "illegal -W value" from qsub. Torque version is 2.3.6, OS is Centos 5.3, and Torque was installed as part of the Maui/Torque Roll made by University of Tromso for Rocks 5.2. This series of bash commands will reproduce the error if the for loop is modified to create between 62 and 75 initial jobs and those job IDs are five digits long. JIDS=""; for (( i=1;i<=62;i++ )); do JID=`echo sleep 15 | qsub`; JIDS="$JIDS:${JID/.*/}"; done; echo $JIDS; JID=`echo date | qsub -W depend=afterany$JIDS`; echo ${JID/.*/} Below is the error message in question. *** glibc detected *** qsub: double free or corruption (out): 0x0000000016e199f0 *** ======= Backtrace: ========= /lib64/libc.so.6[0x323d871ce2] /lib64/libc.so.6(cfree+0x8c)[0x323d87590c] /opt/torque/lib64/libtorque.so.2(parse_depend_list+0x101)[0x2b505d7414a1] qsub[0x403b90] qsub[0x40684f] /lib64/libc.so.6(__libc_start_main+0xf4)[0x323d81d974] qsub[0x402429] ======= Memory map: ======== 00400000-0040a000 r-xp 00000000 08:01 7562691 /opt/torque/bin/qsub 00609000-0060b000 rw-p 00009000 08:01 7562691 /opt/torque/bin/qsub 0060b000-0060c000 rw-p 0060b000 00:00 0 16e19000-16e3a000 rw-p 16e19000 00:00 0 [heap] 323d400000-323d41c000 r-xp 00000000 08:01 6704622 /lib64/ld-2.5.so 323d61b000-323d61c000 r--p 0001b000 08:01 6704622 /lib64/ld-2.5.so 323d61c000-323d61d000 rw-p 0001c000 08:01 6704622 /lib64/ld-2.5.so 323d800000-323d94c000 r-xp 00000000 08:01 6704623 /lib64/libc-2.5.so 323d94c000-323db4c000 ---p 0014c000 08:01 6704623 /lib64/libc-2.5.so 323db4c000-323db50000 r--p 0014c000 08:01 6704623 /lib64/libc-2.5.so 323db50000-323db51000 rw-p 00150000 08:01 6704623 /lib64/libc-2.5.so 323db51000-323db56000 rw-p 323db51000 00:00 0 324fe00000-324fe0d000 r-xp 00000000 08:01 6704530 /lib64/libgcc_s-4.1.2-20080825.so.1 324fe0d000-325000d000 ---p 0000d000 08:01 6704530 /lib64/libgcc_s-4.1.2-20080825.so.1 325000d000-325000e000 rw-p 0000d000 08:01 6704530 /lib64/libgcc_s-4.1.2-20080825.so.1 2b505d708000-2b505d709000 rw-p 2b505d708000 00:00 0 2b505d725000-2b505d726000 rw-p 2b505d725000 00:00 0 2b505d726000-2b505d74f000 r-xp 00000000 08:01 7562709 /opt/torque/lib64/libtorque.so.2.0.0 2b505d74f000-2b505d94f000 ---p 00029000 08:01 7562709 /opt/torque/lib64/libtorque.so.2.0.0 2b505d94f000-2b505d951000 rw-p 00029000 08:01 7562709 /opt/torque/lib64/libtorque.so.2.0.0 2b505d951000-2b505d97f000 rw-p 2b505d951000 00:00 0 2b505d97f000-2b505d989000 r-xp 00000000 08:01 6704347 /lib64/libnss_files-2.5.so 2b505d989000-2b505db88000 ---p 0000a000 08:01 6704347 /lib64/libnss_files-2.5.so 2b505db88000-2b505db89000 r--p 00009000 08:01 6704347 /lib64/libnss_files-2.5.so 2b505db89000-2b505db8a000 rw-p 0000a000 08:01 6704347 /lib64/libnss_files-2.5.so 2b5060000000-2b5060021000 rw-p 2b5060000000 00:00 0 2b5060021000-2b5064000000 ---p 2b5060021000 00:00 0 7fff4aa43000-7fff4aa65000 rw-p 7ffffffdd000 00:00 0 [stack] ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0 [vdso]
can you provide a backtrace with gdb?
(In reply to comment #1) > can you provide a backtrace with gdb? Yes. Here is the gdb session output with backtrace at the end. This was 66 5 digit job dependencies. gdb qsub GNU gdb Fedora (6.8-37.el5) Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu"... (no debugging symbols found) (gdb) run qsub -W depend=afterany:16501:16502:16503:16504:16505:16506:16507:16508:16509:16510:16511:16512:16513:16514:16515:16516:16517:16518:16519:16520:16521:16522:16523:16524:16525:16526:16527:16528:16529:16530:16531:16532:16533:16534:16535:16536:16537:16538:16539:16540:16541:16542:16543:16544:16545:16546:16547:16548:16549:16550:16551:16552:16553:16554:16555:16556:16557:16558:16559:16560:16561:16562:16563:16564:16565:16566 date.sh Starting program: /opt/torque/bin/qsub qsub -W depend=afterany:16501:16502:16503:16504:16505:16506:16507:16508:16509:16510:16511:16512:16513:16514:16515:16516:16517:16518:16519:16520:16521:16522:16523:16524:16525:16526:16527:16528:16529:16530:16531:16532:16533:16534:16535:16536:16537:16538:16539:16540:16541:16542:16543:16544:16545:16546:16547:16548:16549:16550:16551:16552:16553:16554:16555:16556:16557:16558:16559:16560:16561:16562:16563:16564:16565:16566 date.sh (no debugging symbols found) *** glibc detected *** /opt/torque/bin/qsub: munmap_chunk(): invalid pointer: 0x000000001d4869d0 *** ======= Backtrace: ========= /lib64/libc.so.6(cfree+0x1b6)[0x323d875a36] /opt/torque/lib64/libtorque.so.2(parse_depend_list+0x101)[0x2ae34ced84a1] /opt/torque/bin/qsub[0x403b90] /opt/torque/bin/qsub[0x40684f] /lib64/libc.so.6(__libc_start_main+0xf4)[0x323d81d974] /opt/torque/bin/qsub[0x402429] ======= Memory map: ======== 00400000-0040a000 r-xp 00000000 08:01 7562691 /opt/torque/bin/qsub 00609000-0060b000 rw-p 00009000 08:01 7562691 /opt/torque/bin/qsub 0060b000-0060c000 rw-p 0060b000 00:00 0 1d486000-1d4a7000 rw-p 1d486000 00:00 0 [heap] 323d400000-323d41c000 r-xp 00000000 08:01 6704622 /lib64/ld-2.5.so 323d61b000-323d61c000 r--p 0001b000 08:01 6704622 /lib64/ld-2.5.so 323d61c000-323d61d000 rw-p 0001c000 08:01 6704622 /lib64/ld-2.5.so 323d800000-323d94c000 r-xp 00000000 08:01 6704623 /lib64/libc-2.5.so 323d94c000-323db4c000 ---p 0014c000 08:01 6704623 /lib64/libc-2.5.so 323db4c000-323db50000 r--p 0014c000 08:01 6704623 /lib64/libc-2.5.so 323db50000-323db51000 rw-p 00150000 08:01 6704623 /lib64/libc-2.5.so 323db51000-323db56000 rw-p 323db51000 00:00 0 324fe00000-324fe0d000 r-xp 00000000 08:01 6704530 /lib64/libgcc_s-4.1.2-20080825.so.1 324fe0d000-325000d000 ---p 0000d000 08:01 6704530 /lib64/libgcc_s-4.1.2-20080825.so.1 325000d000-325000e000 rw-p 0000d000 08:01 6704530 /lib64/libgcc_s-4.1.2-20080825.so.1 2ae34ce9f000-2ae34cea0000 rw-p 2ae34ce9f000 00:00 0 2ae34cebc000-2ae34cebd000 rw-p 2ae34cebc000 00:00 0 2ae34cebd000-2ae34cee6000 r-xp 00000000 08:01 7562709 /opt/torque/lib64/libtorque.so.2.0.0 2ae34cee6000-2ae34d0e6000 ---p 00029000 08:01 7562709 /opt/torque/lib64/libtorque.so.2.0.0 2ae34d0e6000-2ae34d0e8000 rw-p 00029000 08:01 7562709 /opt/torque/lib64/libtorque.so.2.0.0 2ae34d0e8000-2ae34d116000 rw-p 2ae34d0e8000 00:00 0 2ae34d116000-2ae34d120000 r-xp 00000000 08:01 6704347 /lib64/libnss_files-2.5.so 2ae34d120000-2ae34d31f000 ---p 0000a000 08:01 6704347 /lib64/libnss_files-2.5.so 2ae34d31f000-2ae34d320000 r--p 00009000 08:01 6704347 /lib64/libnss_files-2.5.so 2ae34d320000-2ae34d321000 rw-p 0000a000 08:01 6704347 /lib64/libnss_files-2.5.so 7fffd613e000-7fffd615f000 rw-p 7ffffffde000 00:00 0 [stack] ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0 [vdso] Program received signal SIGABRT, Aborted. 0x000000323d830215 in raise () from /lib64/libc.so.6 (gdb) backtrace #0 0x000000323d830215 in raise () from /lib64/libc.so.6 #1 0x000000323d831cc0 in abort () from /lib64/libc.so.6 #2 0x000000323d86a7fb in __libc_message () from /lib64/libc.so.6 #3 0x000000323d875a36 in free () from /lib64/libc.so.6 #4 0x00002ae34ced84a1 in parse_depend_list (list=<value optimized out>, rtn_list=0x1d4861d0 "afterany:16501.bloom.ocean.washington.edu:16502.bloom.ocean.washington.edu:16503.bloom.ocean.washington.edu:16504.bloom.ocean.washington.edu:16505.bloom.ocean.washington.edu:16506.bloom.ocean.washingt"..., rtn_size=2040) at ../Libcmds/parse_depend.c:270 #5 0x0000000000403b90 in process_opts () #6 0x000000000040684f in main () (gdb)
Changes to correct this were checked into 2.3, 2.4 and 2.5 branches and will be available in future snapshots and releases.