[torqueusers] qsub crashing with dependency list between 62 and 75 jobs long

Chris Berthiaume chrisbee at u.washington.edu
Wed Nov 25 13:17:24 MST 2009


Hello,

qsub will consistently crash when a job is submitted that has a job dependency list between 62 and 75 jobs long.  This could be caused by job count or by the size of the dependency string, I'm not sure.  All the job IDs are 5 digits long so the limit in terms of characters in the list would be 372 to 450 characters (5 digits per job ID plus one for every colon), not counting "afterany".  Fewer than 62 jobs work fine, and more than 75 jobs produces an "illegal -W value" from qsub. The error I'm seeing is

*** glibc detected *** qsub: double free or corruption (out): 0x0000000016e199f0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x323d871ce2]
/lib64/libc.so.6(cfree+0x8c)[0x323d87590c]
/opt/torque/lib64/libtorque.so.2(parse_depend_list+0x101)[0x2b505d7414a1]
qsub[0x403b90]
qsub[0x40684f]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x323d81d974]
qsub[0x402429]
======= Memory map: ========
00400000-0040a000 r-xp 00000000 08:01 7562691                            /opt/torque/bin/qsub
00609000-0060b000 rw-p 00009000 08:01 7562691                            /opt/torque/bin/qsub
0060b000-0060c000 rw-p 0060b000 00:00 0 
16e19000-16e3a000 rw-p 16e19000 00:00 0                                  [heap]
323d400000-323d41c000 r-xp 00000000 08:01 6704622                        /lib64/ld-2.5.so
323d61b000-323d61c000 r--p 0001b000 08:01 6704622                        /lib64/ld-2.5.so
323d61c000-323d61d000 rw-p 0001c000 08:01 6704622                        /lib64/ld-2.5.so
323d800000-323d94c000 r-xp 00000000 08:01 6704623                        /lib64/libc-2.5.so
323d94c000-323db4c000 ---p 0014c000 08:01 6704623                        /lib64/libc-2.5.so
323db4c000-323db50000 r--p 0014c000 08:01 6704623                        /lib64/libc-2.5.so
323db50000-323db51000 rw-p 00150000 08:01 6704623                        /lib64/libc-2.5.so
323db51000-323db56000 rw-p 323db51000 00:00 0 
324fe00000-324fe0d000 r-xp 00000000 08:01 6704530                        /lib64/libgcc_s-4.1.2-20080825.so.1
324fe0d000-325000d000 ---p 0000d000 08:01 6704530                        /lib64/libgcc_s-4.1.2-20080825.so.1
325000d000-325000e000 rw-p 0000d000 08:01 6704530                        /lib64/libgcc_s-4.1.2-20080825.so.1
2b505d708000-2b505d709000 rw-p 2b505d708000 00:00 0 
2b505d725000-2b505d726000 rw-p 2b505d725000 00:00 0 
2b505d726000-2b505d74f000 r-xp 00000000 08:01 7562709                    /opt/torque/lib64/libtorque.so.2.0.0
2b505d74f000-2b505d94f000 ---p 00029000 08:01 7562709                    /opt/torque/lib64/libtorque.so.2.0.0
2b505d94f000-2b505d951000 rw-p 00029000 08:01 7562709                    /opt/torque/lib64/libtorque.so.2.0.0
2b505d951000-2b505d97f000 rw-p 2b505d951000 00:00 0 
2b505d97f000-2b505d989000 r-xp 00000000 08:01 6704347                    /lib64/libnss_files-2.5.so
2b505d989000-2b505db88000 ---p 0000a000 08:01 6704347                    /lib64/libnss_files-2.5.so
2b505db88000-2b505db89000 r--p 00009000 08:01 6704347                    /lib64/libnss_files-2.5.so
2b505db89000-2b505db8a000 rw-p 0000a000 08:01 6704347                    /lib64/libnss_files-2.5.so
2b5060000000-2b5060021000 rw-p 2b5060000000 00:00 0 
2b5060021000-2b5064000000 ---p 2b5060021000 00:00 0 
7fff4aa43000-7fff4aa65000 rw-p 7ffffffdd000 00:00 0                      [stack]
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0                  [vdso]

This bash command will reproduce the error if the for loop is modified to create between 62 and 75 initial jobs.

JIDS=""; for (( i=1;i<=62;i++ )); do JID=`echo sleep 15 | qsub`; JIDS="$JIDS:${JID/.*/}"; done; echo $JIDS; JID=`echo date | qsub -W depend=afterany$JIDS`; echo ${JID/.*/}

I'm using torque version 2.3.6 on Centos 5.3 x86_64.  This is a Rocks cluster (5.2.2) using the Torque/Maui Roll.  Any idea what's causing this?  Is there a documented limit to the size of the dependency string or job count?  I can imagine scenarios where one job might depend on over a hundred other jobs.  Is this supported in Torque or is the right answer to use job arrays or more hierarchical job dependency lists?

Thanks,
Chris

--
Chris Berthiaume
Center for Environmental Genomics
University of Washington


More information about the torqueusers mailing list