[torqueusers] qsub: double free or corruption

Öttl Peter poettl at cscs.ch
Wed Oct 20 08:01:46 MDT 2010


Hi,

we had a user reporting that his jobs are failing with qsub: double free or corruption.

The submit script (generated by gLite CREAM-CE middleware) looks as follows:

Oct 20 15:11 [honeprd at cream02:~]$ cat /tmp/cre02_882222885
#############################################################
#!/bin/bash
# PBS job wrapper generated by pbs_submit.sh
# on Wed Oct 20 10:51:19 CEST 2010
#
# stgcmd = yes
# proxy_string = /opt/glite/var/cream_sandbox/hone/_O_GermanGrid_OU_DESY_CN_Alexander_Fomenko_hone_Role_production_Capability_NULL_honeprd/proxy/12854315302E200928grid2Dwms132Edesy2Ede12365430362862
# proxy_local_file = /opt/glite/var/cream_sandbox/hone/_O_GermanGrid_OU_DESY_CN_Alexander_Fomenko_hone_Role_production_Capability_NULL_honeprd/proxy/12854315302E200928grid2Dwms132Edesy2Ede12365430362862
#
# PBS directives:
#PBS -S /bin/bash
#PBS -o /dev/null
#PBS -e /dev/null
#PBS -q other
#PBS -W stagein=CREAM882222885_jobWrapper.sh at cream02.lcg.cscs.ch:/opt/glite/var/cream_sandbox/hone/_O_GermanGrid_OU_DESY_CN_Alexander_Fomenko_hone_Role_production_Capability_NULL_honeprd/88/CREAM882222885/CREAM882222885_jobWrapper.sh,stagein=cre02_882222885.proxy at cream02.lcg.cscs.ch:/opt/glite/var/cream_sandbox/hone/_O_GermanGrid_OU_DESY_CN_Alexander_Fomenko_hone_Role_production_Capability_NULL_honeprd/proxy/12854315302E200928grid2Dwms132Edesy2Ede12365430362862
#PBS -W stageout=out_cre02_882222885_StandardOutput at cream02.lcg.cscs.ch:/opt/glite/var/cream_sandbox/hone/_O_GermanGrid_OU_DESY_CN_Alexander_Fomenko_hone_Role_production_Capability_NULL_honeprd/88/CREAM882222885/StandardOutput,stageout=err_cre02_882222885_StandardOutput at cream02.lcg.cscs.ch:/opt/glite/var/cream_sandbox/hone/_O_GermanGrid_OU_DESY_CN_Alexander_Fomenko_hone_Role_production_Capability_NULL_honeprd/88/CREAM882222885/StandardError
#PBS -m n
new_home=`pwd`/home_cre02_882222885
mkdir $new_home
mv  CREAM882222885_jobWrapper.sh cre02_882222885.proxy $new_home &>/dev/null
export HOME=$new_home
cd $new_home
# Resetting proxy to local position
export X509_USER_PROXY=$new_home/cre02_882222885.proxy

# Command to execute:
if [ ! -x ./CREAM882222885_jobWrapper.sh ]; then chmod u+x ./CREAM882222885_jobWrapper.sh; fi
if [ -x ${GLITE_LOCATION:-/opt/glite}/libexec/jobwrapper ]
then
${GLITE_LOCATION:-/opt/glite}/libexec/jobwrapper ./CREAM882222885_jobWrapper.sh UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000002:LRMS=000000:APP=000000:LBS=000000 > ../out_cre02_882222885_StandardOutput 2> ../err_cre02_882222885_StandardOutput &
elif [ -x /opt/lcg/libexec/jobwrapper ]
then
/opt/lcg/libexec/jobwrapper ./CREAM882222885_jobWrapper.sh UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000002:LRMS=000000:APP=000000:LBS=000000 > ../out_cre02_882222885_StandardOutput 2> ../err_cre02_882222885_StandardOutput &
else
$new_home/CREAM882222885_jobWrapper.sh UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000002:LRMS=000000:APP=000000:LBS=000000 > ../out_cre02_882222885_StandardOutput 2> ../err_cre02_882222885_StandardOutput &
fi
job_pid=$!

# Wait for the user job to finish
wait $job_pid
user_retcode=$?

# Remove the staged files
rm  CREAM882222885_jobWrapper.sh cre02_882222885.proxy
cd ..
rm -rf $HOME

exit $user_retcode
#############################################################



When submitting this job glibc detects "qsub: double fee or corruption":

Oct 20 15:11 [honeprd at cream02:~]$ qsub /tmp/cre02_882222885
*** glibc detected *** qsub: double free or corruption (!prev): 0x0000000007a7acd0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3d8f6722ef]
/lib64/libc.so.6(cfree+0x4b)[0x3d8f67273b]
qsub[0x402b22]
qsub[0x402e44]
qsub[0x404309]
qsub[0x4069dc]
qsub[0x406fc7]
qsub[0x407680]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3d8f61d994]
qsub[0x402669]
======= Memory map: ========
00400000-0040b000 r-xp 00000000 ca:01 270184                             /usr/bin/qsub
0060a000-0060c000 rw-p 0000a000 ca:01 270184                             /usr/bin/qsub
0060c000-0060e000 rw-p 0060c000 00:00 0
0080b000-0080c000 rw-p 0000b000 ca:01 270184                             /usr/bin/qsub
07a7a000-07a9b000 rw-p 07a7a000 00:00 0
3b50200000-3b5020d000 r-xp 00000000 ca:01 1474613                        /lib64/libgcc_s-4.1.2-20080825.so.1
3b5020d000-3b5040d000 ---p 0000d000 ca:01 1474613                        /lib64/libgcc_s-4.1.2-20080825.so.1
3b5040d000-3b5040e000 rw-p 0000d000 ca:01 1474613                        /lib64/libgcc_s-4.1.2-20080825.so.1
3d8f200000-3d8f21c000 r-xp 00000000 ca:01 1474879                        /lib64/ld-2.5.so
3d8f41b000-3d8f41c000 r--p 0001b000 ca:01 1474879                        /lib64/ld-2.5.so
3d8f41c000-3d8f41d000 rw-p 0001c000 ca:01 1474879                        /lib64/ld-2.5.so
3d8f600000-3d8f74d000 r-xp 00000000 ca:01 1474880                        /lib64/libc-2.5.so
3d8f74d000-3d8f94d000 ---p 0014d000 ca:01 1474880                        /lib64/libc-2.5.so
3d8f94d000-3d8f951000 r--p 0014d000 ca:01 1474880                        /lib64/libc-2.5.so
3d8f951000-3d8f952000 rw-p 00151000 ca:01 1474880                        /lib64/libc-2.5.so
3d8f952000-3d8f957000 rw-p 3d8f952000 00:00 0
3d8fe00000-3d8fe16000 r-xp 00000000 ca:01 1474881                        /lib64/libpthread-2.5.so
3d8fe16000-3d90015000 ---p 00016000 ca:01 1474881                        /lib64/libpthread-2.5.so
3d90015000-3d90016000 r--p 00015000 ca:01 1474881                        /lib64/libpthread-2.5.so
3d90016000-3d90017000 rw-p 00016000 ca:01 1474881                        /lib64/libpthread-2.5.so
3d90017000-3d9001b000 rw-p 3d90017000 00:00 0
3eaf000000-3eaf02a000 r-xp 00000000 ca:01 270488                         /usr/lib/libtorque.so.2.0.0
3eaf02a000-3eaf22a000 ---p 0002a000 ca:01 270488                         /usr/lib/libtorque.so.2.0.0
3eaf22a000-3eaf22c000 rw-p 0002a000 ca:01 270488                         /usr/lib/libtorque.so.2.0.0
3eaf22c000-3eaf30f000 rw-p 3eaf22c000 00:00 0
2b13fa619000-2b13fa61d000 rw-p 2b13fa619000 00:00 0
2b13fa638000-2b13fa63a000 rw-p 2b13fa638000 00:00 0
7fff6d832000-7fff6d85a000 rw-p 7ffffffd6000 00:00 0                      [stack]
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0                  [vdso]



I set MALLOC_CHECK_ to 0 to silently ignore any heap corruptions and the job is working:

Oct 20 15:11 [honeprd at cream02:~]$ export MALLOC_CHECK_=0
Oct 20 15:14 [honeprd at cream02:~]$ qsub /tmp/cre02_882222885
1731985.lrms02.lcg.cscs.ch



For debugging I set MALLOC_CHECK to 3 to abort the command immediately. In this case glibc detects "qsub: free(): invalid pointer":

Oct 20 15:14 [honeprd at cream02:~]$ export MALLOC_CHECK_=3
Oct 20 15:22 [honeprd at cream02:~]$ qsub /tmp/cre02_882222885
malloc: using debugging hooks
*** glibc detected *** qsub: free(): invalid pointer: 0x00000000095b7b70 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3d8f675d81]
/lib64/libc.so.6(cfree+0xd1)[0x3d8f6727c1]
qsub[0x402b22]
qsub[0x402e44]
qsub[0x404309]
qsub[0x4069dc]
qsub[0x406fc7]
qsub[0x407680]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3d8f61d994]
qsub[0x402669]
======= Memory map: ========
00400000-0040b000 r-xp 00000000 ca:01 270184                             /usr/bin/qsub
0060a000-0060c000 rw-p 0000a000 ca:01 270184                             /usr/bin/qsub
0060c000-0060e000 rw-p 0060c000 00:00 0
0080b000-0080c000 rw-p 0000b000 ca:01 270184                             /usr/bin/qsub
095b7000-095d8000 rw-p 095b7000 00:00 0
3b50200000-3b5020d000 r-xp 00000000 ca:01 1474613                        /lib64/libgcc_s-4.1.2-20080825.so.1
3b5020d000-3b5040d000 ---p 0000d000 ca:01 1474613                        /lib64/libgcc_s-4.1.2-20080825.so.1
3b5040d000-3b5040e000 rw-p 0000d000 ca:01 1474613                        /lib64/libgcc_s-4.1.2-20080825.so.1
3d8f200000-3d8f21c000 r-xp 00000000 ca:01 1474879                        /lib64/ld-2.5.so
3d8f41b000-3d8f41c000 r--p 0001b000 ca:01 1474879                        /lib64/ld-2.5.so
3d8f41c000-3d8f41d000 rw-p 0001c000 ca:01 1474879                        /lib64/ld-2.5.so
3d8f600000-3d8f74d000 r-xp 00000000 ca:01 1474880                        /lib64/libc-2.5.so
3d8f74d000-3d8f94d000 ---p 0014d000 ca:01 1474880                        /lib64/libc-2.5.so
3d8f94d000-3d8f951000 r--p 0014d000 ca:01 1474880                        /lib64/libc-2.5.so
3d8f951000-3d8f952000 rw-p 00151000 ca:01 1474880                        /lib64/libc-2.5.so
3d8f952000-3d8f957000 rw-p 3d8f952000 00:00 0
3d8fe00000-3d8fe16000 r-xp 00000000 ca:01 1474881                        /lib64/libpthread-2.5.so
3d8fe16000-3d90015000 ---p 00016000 ca:01 1474881                        /lib64/libpthread-2.5.so
3d90015000-3d90016000 r--p 00015000 ca:01 1474881                        /lib64/libpthread-2.5.so
3d90016000-3d90017000 rw-p 00016000 ca:01 1474881                        /lib64/libpthread-2.5.so
3d90017000-3d9001b000 rw-p 3d90017000 00:00 0
3eaf000000-3eaf02a000 r-xp 00000000 ca:01 270488                         /usr/lib/libtorque.so.2.0.0
3eaf02a000-3eaf22a000 ---p 0002a000 ca:01 270488                         /usr/lib/libtorque.so.2.0.0
3eaf22a000-3eaf22c000 rw-p 0002a000 ca:01 270488                         /usr/lib/libtorque.so.2.0.0
3eaf22c000-3eaf30f000 rw-p 3eaf22c000 00:00 0
2b395e767000-2b395e76b000 rw-p 2b395e767000 00:00 0
2b395e786000-2b395e788000 rw-p 2b395e786000 00:00 0
7fff8bf07000-7fff8bf30000 rw-p 7ffffffd5000 00:00 0                      [stack]
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0                  [vdso]
Aborted



We are running Torque 2.4.11 on SL5.4:

Oct 20 15:47 [root at cream02:~]# lsb_release -a
LSB Version: :core-3.1-amd64:core-3.1-ia32:core-3.1-noarch:graphics-3.1-amd64:graphics-3.1-ia32:graphics-3.1-noarch
Distributor ID: ScientificSL
Description: Scientific Linux SL release 5.4 (Boron)
Release: 5.4
Codename: Boron

Oct 20 15:47 [root at cream02:~]# rpm -qa|grep torque
torque-client-2.4.11-1cri.x86_64
torque-2.4.11-1cri.x86_64



Did anybody ever see this problem?

Cheers,

  Peter

--
Ing. Peter Oettl | CSCS Swiss National Supercomputing Centre
Systems Engineer | HPC Co-Location Services
Via Cantonale, Galleria 2 | CH-6928 Manno
peter.oettl at cscs.ch<mailto:peter.oettl at cscs.ch> | www.cscs.ch<http://www.cscs.ch/> | Phone +41 91 610 82 34



More information about the torqueusers mailing list