[TORQUEUsers] TORQUE & Valgrind - memory handling errors ?

Chris Samuel csamuel@vpac.org
Thu, 31 Jul 2003 12:09:54 +1000


--Boundary-00=_ypHK/fXzTGn6Op9
Content-Type: Text/Plain;
  charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Content-Description: clearsigned data
Content-Disposition: inline

=2D----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

After my previous message about TORQUE SEGV'ing I've managed to free up a cou=
ple=20
of nodes in our cluster here to create a sandpit to do testing work on TORQUE.

Setup - 3 nodes, each with 2 processors, one as the server and 2 compute.

My first test was to build a completely vanilla TORQUE install (no additional=
=20
patches or artificial flavourings) and then run a fairly trivial test (star=
t=20
up pbs_mom on the two compute nodes, do a pbsnodes -l, mark one as off-line=
=20
by pbsnodes -o <node>, do another pbsnodes -l and then shut the server down=
=20
with a qterm -t quick).

TORQUE was configured with (original on one line):

	./configure  --with-scp '--set-cflags=3D-g -Wall' \
	--set-server-home=3D/usr/spool/TORQUE --prefix=3D/usr/local/TORQUE-test

During that test the server was running under valgrind, a nice little memor=
y=20
debugger from:

	http://developer.kde.org/~sewardj/

I started the server with the environment variable PBSDEBUG=3Dtrue to stop =
it=20
forking off and also gives some additional debugging info.

Scarily it picked up an error immediately the server start, to quote:

=3D=3D24663=3D=3D Conditional jump or move depends on uninitialised value(s)
=3D=3D24663=3D=3D    at 0x804FF5C: check_nodes (node_manager.c:663)
=3D=3D24663=3D=3D    by 0x806A1A4: dispatch_task (svr_task.c:162)
=3D=3D24663=3D=3D    by 0x8055346: next_task (pbsd_main.c:721)
=3D=3D24663=3D=3D    by 0x8054EB0: main (pbsd_main.c:565)
tinsert: 2886992728 node088 stream -1
tinsert: 2886992729 node089 stream -1

There was another error created when the first pbs_mom was started on the=20
first of the sandpit compute nodes (node088):

do_rpp: got an inter-server request
is_request: stream 0 version 1
is_request: connect from 172.20.3.88:1023
tinsert: 0 node088 stream 0
is_request: IS_STATUS
=3D=3D24663=3D=3D
=3D=3D24663=3D=3D Conditional jump or move depends on uninitialised value(s)
=3D=3D24663=3D=3D    at 0x804F8D8: is_stat_get (node_manager.c:366)
=3D=3D24663=3D=3D    by 0x80504D2: is_request (node_manager.c:853)
=3D=3D24663=3D=3D    by 0x80542DA: do_rpp (pbsd_main.c:228)
=3D=3D24663=3D=3D    by 0x805435A: rpp_request (pbsd_main.c:253)
is_request: node088(0)
is_request: 0   172.20.3.88
is_request: node089(1)
is_request: 0   172.20.3.89

Interestingly enough this didn't occur when the second pbs_mom connects=20
(node089):

do_rpp: got an inter-server request
is_request: stream 1 version 1
is_request: connect from 172.20.3.89:1023
tinsert: 1 node089 stream 1
is_request: IS_STATUS
is_request: node088(0)
is_request: 0   172.20.3.88
is_request: node089(1)
is_request: 0   172.20.3.89

Now, when I tried to do "pbsnodes -o node088" things really became noisy. H=
ere=20
we see two invalid reads()'s followed by two invalid free()'s:

=3D=3D24663=3D=3D
=3D=3D24663=3D=3D Invalid read of size 4
=3D=3D24663=3D=3D    at 0x805A5E7: mgr_set_node_attr (req_manager.c:656)
=3D=3D24663=3D=3D    by 0x805B3A6: mgr_node_set (req_manager.c:1075)
=3D=3D24663=3D=3D    by 0x805BD76: req_manager (req_manager.c:1437)
=3D=3D24663=3D=3D    by 0x80559F6: dispatch_request (process_request.c:398)
=3D=3D24663=3D=3D    Address 0x4123811C is 12 bytes inside a block of size =
68 free'd
=3D=3D24663=3D=3D    at 0x40026B6F: free (vg_replace_malloc.c:220)
=3D=3D24663=3D=3D    by 0x806FE06: node_status_list (attr_node_func.c:822)
=3D=3D24663=3D=3D    by 0x805A54E: mgr_set_node_attr (req_manager.c:634)
=3D=3D24663=3D=3D    by 0x805B3A6: mgr_node_set (req_manager.c:1075)
=3D=3D24663=3D=3D
=3D=3D24663=3D=3D Invalid read of size 4
=3D=3D24663=3D=3D    at 0x805A5F6: mgr_set_node_attr (req_manager.c:657)
=3D=3D24663=3D=3D    by 0x805B3A6: mgr_node_set (req_manager.c:1075)
=3D=3D24663=3D=3D    by 0x805BD76: req_manager (req_manager.c:1437)
=3D=3D24663=3D=3D    by 0x80559F6: dispatch_request (process_request.c:398)
=3D=3D24663=3D=3D    Address 0x4123811C is 12 bytes inside a block of size =
68 free'd
=3D=3D24663=3D=3D    at 0x40026B6F: free (vg_replace_malloc.c:220)
=3D=3D24663=3D=3D    by 0x806FE06: node_status_list (attr_node_func.c:822)
=3D=3D24663=3D=3D    by 0x805A54E: mgr_set_node_attr (req_manager.c:634)
=3D=3D24663=3D=3D    by 0x805B3A6: mgr_node_set (req_manager.c:1075)
=3D=3D24663=3D=3D
=3D=3D24663=3D=3D Invalid free() / delete / delete[]
=3D=3D24663=3D=3D    at 0x40026B6F: free (vg_replace_mall=3D=3D24663=3D=3D
=3D=3D24663=3D=3D Syscall param write(buf) contains uninitialised or unaddr=
essable=20
byte(
s)
=3D=3D24663=3D=3D    at 0x4030E754: __libc_write (in /lib/libc-2.2.5.so)
=3D=3D24663=3D=3D    by 0x804AA06: save_flush (attr_recov.c:220)
=3D=3D24663=3D=3D    by 0x806925C: svr_save (svr_recov.c:233)
=3D=3D24663=3D=3D    by 0x80550AE: main (pbsd_main.c:630)
=3D=3D24663=3D=3D    Address 0x8085CC2 is not stack'd, malloc'd or free'd
=3D=3D24663=3D=3D
=3D=3D24663=3D=3D ERROR SUMMARY: 10 errors from 7 contexts (suppressed: 9 f=
rom 1)
=3D=3D24663=3D=3D malloc/free: in use at exit: 166696 bytes in 107 blocks.
=3D=3D24663=3D=3D malloc/free: 514 allocs, 409 frees, 823548 bytes allocate=
d.
=3D=3D24663=3D=3D For a detailed leak analysis,  rerun with: --leak-check=
=3Dyes
=3D=3D24663=3D=3D For counts of detected errors, rerun with: -v
oc.c:220)
=3D=3D24663=3D=3D    by 0x805A5FD: mgr_set_node_attr (req_manager.c:657)
=3D=3D24663=3D=3D    by 0x805B3A6: mgr_node_set (req_manager.c:1075)
=3D=3D24663=3D=3D    by 0x805BD76: req_manager (req_manager.c:1437)
=3D=3D24663=3D=3D    Address 0x41237FB0 is 0 bytes inside a block of size 3=
02 free'd
=3D=3D24663=3D=3D    at 0x40026B6F: free (vg_replace_malloc.c:220)
=3D=3D24663=3D=3D    by 0x806FDF5: node_status_list (attr_node_func.c:821)
=3D=3D24663=3D=3D    by 0x805A54E: mgr_set_node_attr (req_manager.c:634)
=3D=3D24663=3D=3D    by 0x805B3A6: mgr_node_set (req_manager.c:1075)
=3D=3D24663=3D=3D
=3D=3D24663=3D=3D Invalid free() / delete / delete[]
=3D=3D24663=3D=3D    at 0x40026B6F: free (vg_replace_malloc.c:220)
=3D=3D24663=3D=3D    by 0x805A60E: mgr_set_node_attr (req_manager.c:658)
=3D=3D24663=3D=3D    by 0x805B3A6: mgr_node_set (req_manager.c:1075)
=3D=3D24663=3D=3D    by 0x805BD76: req_manager (req_manager.c:1437)
=3D=3D24663=3D=3D    Address 0x41238110 is 0 bytes inside a block of size 6=
8 free'd
=3D=3D24663=3D=3D    at 0x40026B6F: free (vg_replace_malloc.c:220)
=3D=3D24663=3D=3D    by 0x806FE06: node_status_list (attr_node_func.c:822)
=3D=3D24663=3D=3D    by 0x805A54E: mgr_set_node_attr (req_manager.c:634)
=3D=3D24663=3D=3D    by 0x805B3A6: mgr_node_set (req_manager.c:1075)
write_node_state: entered
update_nodes_file: entered

Then one of the mom's talks to us with no problems:

do_rpp: got an inter-server request
is_request: stream 0 version 1
is_request: connect from 172.20.3.88:1023
is_request: IS_STATUS

and then we do a "qterm -t quick" and we get our final error message:

=3D=3D24663=3D=3D
=3D=3D24663=3D=3D Syscall param write(buf) contains uninitialised or unaddr=
essable=20
byte(s)
=3D=3D24663=3D=3D    at 0x4030E754: __libc_write (in /lib/libc-2.2.5.so)
=3D=3D24663=3D=3D    by 0x804AA06: save_flush (attr_recov.c:220)
=3D=3D24663=3D=3D    by 0x806925C: svr_save (svr_recov.c:233)
=3D=3D24663=3D=3D    by 0x80550AE: main (pbsd_main.c:630)
=3D=3D24663=3D=3D    Address 0x8085CC2 is not stack'd, malloc'd or free'd

and now for the summary at the end:

=3D=3D24663=3D=3D
=3D=3D24663=3D=3D ERROR SUMMARY: 10 errors from 7 contexts (suppressed: 9 f=
rom 1)
=3D=3D24663=3D=3D malloc/free: in use at exit: 166696 bytes in 107 blocks.
=3D=3D24663=3D=3D malloc/free: 514 allocs, 409 frees, 823548 bytes allocate=
d.
=3D=3D24663=3D=3D For a detailed leak analysis,  rerun with: --leak-check=
=3Dyes
=3D=3D24663=3D=3D For counts of detected errors, rerun with: -v

I've attached the log (server_debug), the script that runs the server with =
the=20
options (server-script) and the script that generates the commands (pbs-tes=
t)=20
to help people try and reproduce it.

cheers!
Chris
=2D --=20
 Chris Samuel -- VPAC Systems & Network Admin
 Victorian Partnership for Advanced Computing
 Bldg 91, 110 Victoria Street, Carlton South,
 VIC 3053, Australia - http://www.vpac.org/

=2D----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/KHp2O2KABBYQAh8RAnHcAJ0RQR+Xs8Eqmu3aUxTxwxCe4zQovgCeLtis
99KDpDKyRKCdZEUiZewFVZ8=3D
=3DnRsZ
=2D----END PGP SIGNATURE-----

--Boundary-00=_ypHK/fXzTGn6Op9
Content-Type: text/plain;
  charset="us-ascii";
  name="server_debug"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="server_debug"

+ export PATH=/usr/local/TORQUE-test/bin:/usr/local/TORQUE-test/sbin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/opt/bin:/root/bin
+ PATH=/usr/local/TORQUE-test/bin:/usr/local/TORQUE-test/sbin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/opt/bin:/root/bin
+ export PBSDEBUG=true
+ PBSDEBUG=true
+ /usr/local/TORQUE-test/sbin/pbs_sched
+ /home/ua/csamuel/bin/valgrind /usr/local/TORQUE-test/sbin/pbs_server
==24663== Memcheck, a.k.a. Valgrind, a memory error detector for x86-linux.
==24663== Copyright (C) 2002-2003, and GNU GPL'd, by Julian Seward.
==24663== Using valgrind-20030725, a program supervision framework for x86-linux.
==24663== Copyright (C) 2000-2003, and GNU GPL'd, by Julian Seward.
==24663== Estimated CPU clock rate is 2811 MHz
==24663== For more details, rerun with: -v
==24663== 
==24663== Conditional jump or move depends on uninitialised value(s)
==24663==    at 0x804FF5C: check_nodes (node_manager.c:663)
==24663==    by 0x806A1A4: dispatch_task (svr_task.c:162)
==24663==    by 0x8055346: next_task (pbsd_main.c:721)
==24663==    by 0x8054EB0: main (pbsd_main.c:565)
tinsert: 2886992728 node088 stream -1
tinsert: 2886992729 node089 stream -1
do_rpp: got an inter-server request
is_request: stream 0 version 1
is_request: connect from 172.20.3.88:1023
tinsert: 0 node088 stream 0
is_request: IS_STATUS
==24663== 
==24663== Conditional jump or move depends on uninitialised value(s)
==24663==    at 0x804F8D8: is_stat_get (node_manager.c:366)
==24663==    by 0x80504D2: is_request (node_manager.c:853)
==24663==    by 0x80542DA: do_rpp (pbsd_main.c:228)
==24663==    by 0x805435A: rpp_request (pbsd_main.c:253)
is_request: node088(0)
is_request: 0	172.20.3.88
is_request: node089(1)
is_request: 0	172.20.3.89
do_rpp: got an inter-server request
is_request: stream 1 version 1
is_request: connect from 172.20.3.89:1023
tinsert: 1 node089 stream 1
is_request: IS_STATUS
is_request: node088(0)
is_request: 0	172.20.3.88
is_request: node089(1)
is_request: 0	172.20.3.89
==24663== 
==24663== Invalid read of size 4
==24663==    at 0x805A5E7: mgr_set_node_attr (req_manager.c:656)
==24663==    by 0x805B3A6: mgr_node_set (req_manager.c:1075)
==24663==    by 0x805BD76: req_manager (req_manager.c:1437)
==24663==    by 0x80559F6: dispatch_request (process_request.c:398)
==24663==    Address 0x4123811C is 12 bytes inside a block of size 68 free'd
==24663==    at 0x40026B6F: free (vg_replace_malloc.c:220)
==24663==    by 0x806FE06: node_status_list (attr_node_func.c:822)
==24663==    by 0x805A54E: mgr_set_node_attr (req_manager.c:634)
==24663==    by 0x805B3A6: mgr_node_set (req_manager.c:1075)
==24663== 
==24663== Invalid read of size 4
==24663==    at 0x805A5F6: mgr_set_node_attr (req_manager.c:657)
==24663==    by 0x805B3A6: mgr_node_set (req_manager.c:1075)
==24663==    by 0x805BD76: req_manager (req_manager.c:1437)
==24663==    by 0x80559F6: dispatch_request (process_request.c:398)
==24663==    Address 0x4123811C is 12 bytes inside a block of size 68 free'd
==24663==    at 0x40026B6F: free (vg_replace_malloc.c:220)
==24663==    by 0x806FE06: node_status_list (attr_node_func.c:822)
==24663==    by 0x805A54E: mgr_set_node_attr (req_manager.c:634)
==24663==    by 0x805B3A6: mgr_node_set (req_manager.c:1075)
==24663== 
==24663== Invalid free() / delete / delete[]
==24663==    at 0x40026B6F: free (vg_replace_malloc.c:220)
==24663==    by 0x805A5FD: mgr_set_node_attr (req_manager.c:657)
==24663==    by 0x805B3A6: mgr_node_set (req_manager.c:1075)
==24663==    by 0x805BD76: req_manager (req_manager.c:1437)
==24663==    Address 0x41237FB0 is 0 bytes inside a block of size 302 free'd
==24663==    at 0x40026B6F: free (vg_replace_malloc.c:220)
==24663==    by 0x806FDF5: node_status_list (attr_node_func.c:821)
==24663==    by 0x805A54E: mgr_set_node_attr (req_manager.c:634)
==24663==    by 0x805B3A6: mgr_node_set (req_manager.c:1075)
==24663== 
==24663== Invalid free() / delete / delete[]
==24663==    at 0x40026B6F: free (vg_replace_malloc.c:220)
==24663==    by 0x805A60E: mgr_set_node_attr (req_manager.c:658)
==24663==    by 0x805B3A6: mgr_node_set (req_manager.c:1075)
==24663==    by 0x805BD76: req_manager (req_manager.c:1437)
==24663==    Address 0x41238110 is 0 bytes inside a block of size 68 free'd
==24663==    at 0x40026B6F: free (vg_replace_malloc.c:220)
==24663==    by 0x806FE06: node_status_list (attr_node_func.c:822)
==24663==    by 0x805A54E: mgr_set_node_attr (req_manager.c:634)
==24663==    by 0x805B3A6: mgr_node_set (req_manager.c:1075)
write_node_state: entered
update_nodes_file: entered
do_rpp: got an inter-server request
is_request: stream 0 version 1
is_request: connect from 172.20.3.88:1023
is_request: IS_STATUS
==24663== 
==24663== Syscall param write(buf) contains uninitialised or unaddressable byte(s)
==24663==    at 0x4030E754: __libc_write (in /lib/libc-2.2.5.so)
==24663==    by 0x804AA06: save_flush (attr_recov.c:220)
==24663==    by 0x806925C: svr_save (svr_recov.c:233)
==24663==    by 0x80550AE: main (pbsd_main.c:630)
==24663==    Address 0x8085CC2 is not stack'd, malloc'd or free'd
==24663== 
==24663== ERROR SUMMARY: 10 errors from 7 contexts (suppressed: 9 from 1)
==24663== malloc/free: in use at exit: 166696 bytes in 107 blocks.
==24663== malloc/free: 514 allocs, 409 frees, 823548 bytes allocated.
==24663== For a detailed leak analysis,  rerun with: --leak-check=yes
==24663== For counts of detected errors, rerun with: -v
+ killall pbs_sched
+ exit

--Boundary-00=_ypHK/fXzTGn6Op9
Content-Type: application/x-shellscript;
  name="server-script"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="server-script"

#!/bin/sh -x
export PATH=/usr/local/TORQUE-test/bin:/usr/local/TORQUE-test/sbin:$PATH
export PBSDEBUG=true

/usr/local/TORQUE-test/sbin/pbs_sched
~csamuel/bin/valgrind -v /usr/local/TORQUE-test/sbin/pbs_server
killall pbs_sched
exit

--Boundary-00=_ypHK/fXzTGn6Op9
Content-Type: application/x-shellscript;
  name="pbs-test"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="pbs-test"

#!/bin/sh -x
export PATH=/usr/local/TORQUE-test/bin:/usr/local/TORQUE-test/sbin:$PATH

pbsnodes -l
sleep 5
pbsnodes -o node088
sleep 5
pbsnodes -l
sleep 5
qterm -t quick
exit

--Boundary-00=_ypHK/fXzTGn6Op9--