Bug 36 - qsub crashes with long depedency list
: qsub crashes with long depedency list
Status: RESOLVED FIXED
Product: TORQUE
clients
: 2.3.x
: PC Linux
: P5 normal
Assigned To: Al Taufer
:
:
:
  Show dependency treegraph
 
Reported: 2009-11-30 12:02 MST by Chris Berthiaume
Modified: 2009-12-08 16:22 MST (History)
1 user (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description Chris Berthiaume 2009-11-30 12:02:28 MST
qsub will consistently crash when a job is submitted that has a job dependency
list between 62 and 75 jobs long.  This could be caused by job count or by the
size of the dependency string, I'm not sure.  All the job IDs are 5 digits long
so the limit in terms of characters in the list would be 372 to 450 characters
(5 digits per job ID plus one for every colon), not counting "afterany".  Fewer
than 62 jobs work fine, and more than 75 jobs produces an "illegal -W value"
from qsub.  Torque version is 2.3.6, OS is Centos 5.3, and Torque was installed
as part of the Maui/Torque Roll made by University of Tromso for Rocks 5.2.

This series of bash commands will reproduce the error if the for loop is
modified to create between 62 and 75 initial jobs and those job IDs are five
digits long.

JIDS=""; for (( i=1;i<=62;i++ )); do JID=`echo sleep 15 | qsub`;
JIDS="$JIDS:${JID/.*/}"; done; echo $JIDS; JID=`echo date | qsub -W
depend=afterany$JIDS`; echo ${JID/.*/}

Below is the error message in question.

*** glibc detected *** qsub: double free or corruption (out):
0x0000000016e199f0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x323d871ce2]
/lib64/libc.so.6(cfree+0x8c)[0x323d87590c]
/opt/torque/lib64/libtorque.so.2(parse_depend_list+0x101)[0x2b505d7414a1]
qsub[0x403b90]
qsub[0x40684f]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x323d81d974]
qsub[0x402429]
======= Memory map: ========
00400000-0040a000 r-xp 00000000 08:01 7562691                           
/opt/torque/bin/qsub
00609000-0060b000 rw-p 00009000 08:01 7562691                           
/opt/torque/bin/qsub
0060b000-0060c000 rw-p 0060b000 00:00 0 
16e19000-16e3a000 rw-p 16e19000 00:00 0                                  [heap]
323d400000-323d41c000 r-xp 00000000 08:01 6704622                       
/lib64/ld-2.5.so
323d61b000-323d61c000 r--p 0001b000 08:01 6704622                       
/lib64/ld-2.5.so
323d61c000-323d61d000 rw-p 0001c000 08:01 6704622                       
/lib64/ld-2.5.so
323d800000-323d94c000 r-xp 00000000 08:01 6704623                       
/lib64/libc-2.5.so
323d94c000-323db4c000 ---p 0014c000 08:01 6704623                       
/lib64/libc-2.5.so
323db4c000-323db50000 r--p 0014c000 08:01 6704623                       
/lib64/libc-2.5.so
323db50000-323db51000 rw-p 00150000 08:01 6704623                       
/lib64/libc-2.5.so
323db51000-323db56000 rw-p 323db51000 00:00 0 
324fe00000-324fe0d000 r-xp 00000000 08:01 6704530                       
/lib64/libgcc_s-4.1.2-20080825.so.1
324fe0d000-325000d000 ---p 0000d000 08:01 6704530                       
/lib64/libgcc_s-4.1.2-20080825.so.1
325000d000-325000e000 rw-p 0000d000 08:01 6704530                       
/lib64/libgcc_s-4.1.2-20080825.so.1
2b505d708000-2b505d709000 rw-p 2b505d708000 00:00 0 
2b505d725000-2b505d726000 rw-p 2b505d725000 00:00 0 
2b505d726000-2b505d74f000 r-xp 00000000 08:01 7562709                   
/opt/torque/lib64/libtorque.so.2.0.0
2b505d74f000-2b505d94f000 ---p 00029000 08:01 7562709                   
/opt/torque/lib64/libtorque.so.2.0.0
2b505d94f000-2b505d951000 rw-p 00029000 08:01 7562709                   
/opt/torque/lib64/libtorque.so.2.0.0
2b505d951000-2b505d97f000 rw-p 2b505d951000 00:00 0 
2b505d97f000-2b505d989000 r-xp 00000000 08:01 6704347                   
/lib64/libnss_files-2.5.so
2b505d989000-2b505db88000 ---p 0000a000 08:01 6704347                   
/lib64/libnss_files-2.5.so
2b505db88000-2b505db89000 r--p 00009000 08:01 6704347                   
/lib64/libnss_files-2.5.so
2b505db89000-2b505db8a000 rw-p 0000a000 08:01 6704347                   
/lib64/libnss_files-2.5.so
2b5060000000-2b5060021000 rw-p 2b5060000000 00:00 0 
2b5060021000-2b5064000000 ---p 2b5060021000 00:00 0 
7fff4aa43000-7fff4aa65000 rw-p 7ffffffdd000 00:00 0                     
[stack]
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0                  [vdso]
Comment 1 Glen 2009-11-30 20:39:35 MST
can you provide a backtrace with gdb?
Comment 2 Chris Berthiaume 2009-12-01 11:42:58 MST
(In reply to comment #1)
> can you provide a backtrace with gdb?

Yes.  Here is the gdb session output with backtrace at the end.  This was 66 5
digit job dependencies.

gdb qsub
GNU gdb Fedora (6.8-37.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
(no debugging symbols found)
(gdb) run qsub -W
depend=afterany:16501:16502:16503:16504:16505:16506:16507:16508:16509:16510:16511:16512:16513:16514:16515:16516:16517:16518:16519:16520:16521:16522:16523:16524:16525:16526:16527:16528:16529:16530:16531:16532:16533:16534:16535:16536:16537:16538:16539:16540:16541:16542:16543:16544:16545:16546:16547:16548:16549:16550:16551:16552:16553:16554:16555:16556:16557:16558:16559:16560:16561:16562:16563:16564:16565:16566
date.sh
Starting program: /opt/torque/bin/qsub qsub -W
depend=afterany:16501:16502:16503:16504:16505:16506:16507:16508:16509:16510:16511:16512:16513:16514:16515:16516:16517:16518:16519:16520:16521:16522:16523:16524:16525:16526:16527:16528:16529:16530:16531:16532:16533:16534:16535:16536:16537:16538:16539:16540:16541:16542:16543:16544:16545:16546:16547:16548:16549:16550:16551:16552:16553:16554:16555:16556:16557:16558:16559:16560:16561:16562:16563:16564:16565:16566
date.sh
(no debugging symbols found)
*** glibc detected *** /opt/torque/bin/qsub: munmap_chunk(): invalid pointer:
0x000000001d4869d0 ***
======= Backtrace: =========
/lib64/libc.so.6(cfree+0x1b6)[0x323d875a36]
/opt/torque/lib64/libtorque.so.2(parse_depend_list+0x101)[0x2ae34ced84a1]
/opt/torque/bin/qsub[0x403b90]
/opt/torque/bin/qsub[0x40684f]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x323d81d974]
/opt/torque/bin/qsub[0x402429]
======= Memory map: ========
00400000-0040a000 r-xp 00000000 08:01 7562691                           
/opt/torque/bin/qsub
00609000-0060b000 rw-p 00009000 08:01 7562691                           
/opt/torque/bin/qsub
0060b000-0060c000 rw-p 0060b000 00:00 0
1d486000-1d4a7000 rw-p 1d486000 00:00 0                                  [heap]
323d400000-323d41c000 r-xp 00000000 08:01 6704622                       
/lib64/ld-2.5.so
323d61b000-323d61c000 r--p 0001b000 08:01 6704622                       
/lib64/ld-2.5.so
323d61c000-323d61d000 rw-p 0001c000 08:01 6704622                       
/lib64/ld-2.5.so
323d800000-323d94c000 r-xp 00000000 08:01 6704623                       
/lib64/libc-2.5.so
323d94c000-323db4c000 ---p 0014c000 08:01 6704623                       
/lib64/libc-2.5.so
323db4c000-323db50000 r--p 0014c000 08:01 6704623                       
/lib64/libc-2.5.so
323db50000-323db51000 rw-p 00150000 08:01 6704623                       
/lib64/libc-2.5.so
323db51000-323db56000 rw-p 323db51000 00:00 0
324fe00000-324fe0d000 r-xp 00000000 08:01 6704530                       
/lib64/libgcc_s-4.1.2-20080825.so.1
324fe0d000-325000d000 ---p 0000d000 08:01 6704530                       
/lib64/libgcc_s-4.1.2-20080825.so.1
325000d000-325000e000 rw-p 0000d000 08:01 6704530                       
/lib64/libgcc_s-4.1.2-20080825.so.1
2ae34ce9f000-2ae34cea0000 rw-p 2ae34ce9f000 00:00 0
2ae34cebc000-2ae34cebd000 rw-p 2ae34cebc000 00:00 0
2ae34cebd000-2ae34cee6000 r-xp 00000000 08:01 7562709                   
/opt/torque/lib64/libtorque.so.2.0.0
2ae34cee6000-2ae34d0e6000 ---p 00029000 08:01 7562709                   
/opt/torque/lib64/libtorque.so.2.0.0
2ae34d0e6000-2ae34d0e8000 rw-p 00029000 08:01 7562709                   
/opt/torque/lib64/libtorque.so.2.0.0
2ae34d0e8000-2ae34d116000 rw-p 2ae34d0e8000 00:00 0
2ae34d116000-2ae34d120000 r-xp 00000000 08:01 6704347                   
/lib64/libnss_files-2.5.so
2ae34d120000-2ae34d31f000 ---p 0000a000 08:01 6704347                   
/lib64/libnss_files-2.5.so
2ae34d31f000-2ae34d320000 r--p 00009000 08:01 6704347                   
/lib64/libnss_files-2.5.so
2ae34d320000-2ae34d321000 rw-p 0000a000 08:01 6704347                   
/lib64/libnss_files-2.5.so
7fffd613e000-7fffd615f000 rw-p 7ffffffde000 00:00 0                     
[stack]
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0                  [vdso]

Program received signal SIGABRT, Aborted.
0x000000323d830215 in raise () from /lib64/libc.so.6
(gdb) backtrace
#0  0x000000323d830215 in raise () from /lib64/libc.so.6
#1  0x000000323d831cc0 in abort () from /lib64/libc.so.6
#2  0x000000323d86a7fb in __libc_message () from /lib64/libc.so.6
#3  0x000000323d875a36 in free () from /lib64/libc.so.6
#4  0x00002ae34ced84a1 in parse_depend_list (list=<value optimized out>,
    rtn_list=0x1d4861d0
"afterany:16501.bloom.ocean.washington.edu:16502.bloom.ocean.washington.edu:16503.bloom.ocean.washington.edu:16504.bloom.ocean.washington.edu:16505.bloom.ocean.washington.edu:16506.bloom.ocean.washingt"...,
    rtn_size=2040) at ../Libcmds/parse_depend.c:270
#5  0x0000000000403b90 in process_opts ()
#6  0x000000000040684f in main ()
(gdb)
Comment 3 Al Taufer 2009-12-08 16:22:12 MST
Changes to correct this were checked into 2.3, 2.4 and 2.5 branches and will be
available in future snapshots and releases.