Bug 98 - Allocation of incorrect pointer in src/scheduler.cc/samples/fifo/job_info.c: update_job_comment causes random crash.
: Allocation of incorrect pointer in src/scheduler.cc/samples/fifo/job_info.c: ...
Status: RESOLVED FIXED
Product: TORQUE
pbs_sched
: 2.4.x
: PC Linux
: P5 critical
Assigned To: Glen
:
:
:
  Show dependency treegraph
 
Reported: 2010-11-09 02:58 MST by Stephen Usher
Modified: 2010-11-22 19:00 MST (History)
5 users (show)

See Also:


Attachments
job_info.c: Adds local copy of comment passed to pbs_alterjob() to stop random SEGV (in sprintf()) and to safeguard data. (762 bytes, patch)
2010-11-11 03:06 MST, Stephen Usher
Details | Diff
job_info.c: Adds local copy of comment in a static buffer to be passed tp pbs_alterjob. (1.03 KB, patch)
2010-11-11 03:24 MST, Stephen Usher
Details | Diff
src/scheduler.cc/sample/fifo.c: sprintf() used unsafely with error messages causing buffer overrun. Fix using snprintf(). (1.54 KB, patch)
2010-11-11 06:09 MST, Stephen Usher
Details | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description Stephen Usher 2010-11-09 02:58:38 MST
pbs_sched will crash randomly when there are large numbers of jobs in a queue
due to the assignment of the wrong pointer at:

src/scheduler.cc/samples/fifo/job_info.c:695: update_job_comment

The pointer passed to the routine as the message text is assigned to attr.value
instead of the duplicated copy. Depending upon the code execution path this may
point to a stack-based buffer which could disappear later in the program's
processing. Not only this, but the value held will change
non-deterministically.

--- torque-2.4.11/src/scheduler.cc/samples/fifo/job_info.c.orig 2009-10-29
21:01:21.000000000 +0000
+++ torque-2.4.11/src/scheduler.cc/samples/fifo/job_info.c      2010-11-08
15:37:57.848143299 +0000
@@ -692,7 +692,7 @@

     jinfo -> comment = string_dup(comment);

-    attr.value = comment;
+    attr.value = jinfo -> comment;

     pbs_alterjob(pbs_sd, jinfo -> name, &attr, NULL);
Comment 1 Chris Samuel 2010-11-09 16:45:13 MST
Hiya Steve! :-)

Good catch, the patch will apply to the 2.4-fixes, 2.5-fixes and 3.0 branches
as well as trunk.

Can someone with commit privs apply it please ?

thanks!
Chris
Comment 2 David Singleton 2010-11-09 16:54:22 MST
I guess I dont quite understand. Isn't pbs_alterjob() synchronous? Why does it
matter that comment may be stack-based? The server copy wont be.

Cheers,
David

(In reply to comment #0)
> pbs_sched will crash randomly when there are large numbers of jobs in a queue
> due to the assignment of the wrong pointer at:
> 
> src/scheduler.cc/samples/fifo/job_info.c:695: update_job_comment
> 
> The pointer passed to the routine as the message text is assigned to attr.value
> instead of the duplicated copy. Depending upon the code execution path this may
> point to a stack-based buffer which could disappear later in the program's
> processing. Not only this, but the value held will change
> non-deterministically.
> 
> --- torque-2.4.11/src/scheduler.cc/samples/fifo/job_info.c.orig 2009-10-29
> 21:01:21.000000000 +0000
> +++ torque-2.4.11/src/scheduler.cc/samples/fifo/job_info.c      2010-11-08
> 15:37:57.848143299 +0000
> @@ -692,7 +692,7 @@
> 
>      jinfo -> comment = string_dup(comment);
> 
> -    attr.value = comment;
> +    attr.value = jinfo -> comment;
> 
>      pbs_alterjob(pbs_sd, jinfo -> name, &attr, NULL);
Comment 3 Simon Toth 2010-11-10 01:12:57 MST
(In reply to comment #2)
> I guess I dont quite understand. Isn't pbs_alterjob() synchronous? Why does it
> matter that comment may be stack-based? The server copy wont be.

Yes it is synchronous. This patch doesn't change anything.
Comment 4 Simon Toth 2010-11-10 01:35:14 MST
(In reply to comment #3)
> (In reply to comment #2)
> > I guess I dont quite understand. Isn't pbs_alterjob() synchronous? Why does it
> > matter that comment may be stack-based? The server copy wont be.
> 
> Yes it is synchronous. This patch doesn't change anything.

Btw, if indeed a heap pointer is required (and presumably freed) the correct
fix would be attr.value = strdup(comment), because otherwise the jinfo->comment
pointer would be destroyed.
Comment 5 Stephen Usher 2010-11-10 02:23:15 MST
The empirical evidence is that pointer to writable memory is required by
pbs_alterjob() as without my patch pbs_sched would run for a few minutes before
dying with a SEGV in sprintf(). (This is with around 2500 jobs in the queue by
the way.) With this patch it's rock solid.

Simon, agreed, strdup() is probably the way to go.
Comment 6 Stephen Usher 2010-11-10 02:27:06 MST
Actually, thinking about it.. when would the string created by strdup() ever be
freed? The answer is, it wouldn't. There will be a memory leak.

So, after the call to pbs_alterjob() should the pointer be free()ed?

If a structure within the pbs job info storage is merely using a reference to
the passed memory then this could cause further problems later, if not then
it's not a problem.
Comment 7 Simon Toth 2010-11-10 02:29:28 MST
(In reply to comment #5)
> The empirical evidence is that pointer to writable memory is required by
> pbs_alterjob() as without my patch pbs_sched would run for a few minutes before
> dying with a SEGV in sprintf(). (This is with around 2500 jobs in the queue by
> the way.) With this patch it's rock solid.
> 
> Simon, agreed, strdup() is probably the way to go.

This is weird. Our scheduler is based of FIFO (and the specified code is
actually identical). Still it doesn't crash (and we did extensive stress
testing - hundreds of tests, each with 5000 jobs).

Could you post the stack trace of one of the crashes?
Comment 8 Simon Toth 2010-11-10 02:32:03 MST
(In reply to comment #6)
> Actually, thinking about it.. when would the string created by strdup() ever be
> freed? The answer is, it wouldn't. There will be a memory leak.
> 
> So, after the call to pbs_alterjob() should the pointer be free()ed?
> 
> If a structure within the pbs job info storage is merely using a reference to
> the passed memory then this could cause further problems later, if not then
> it's not a problem.

There are only two options. pbs_alterjob requires a heap allocated pointer that
is freed inside pbs_alterjob, or it takes a string constant. I think that it
takes a string constant.
Comment 9 David Singleton 2010-11-10 02:33:50 MST
(In reply to comment #5)
> The empirical evidence is that pointer to writable memory is required by
> pbs_alterjob() as without my patch pbs_sched would run for a few minutes before
> dying with a SEGV in sprintf(). (This is with around 2500 jobs in the queue by
> the way.) With this patch it's rock solid.
> 

I suspect that just means there is a memory bug somewhere else. Have you tried
running it (the original) under valgrind?

David
Comment 10 Stephen Usher 2010-11-10 02:34:53 MST
I will take a look to see if I still have one. The problem is that I can't
afford to give any more time to this problem as this Torque installation is for
one research group amongst many and the three days I spent tying this down was
more than I could really afford.
Comment 11 Simon Toth 2010-11-10 02:43:29 MST
(In reply to comment #10)
> I will take a look to see if I still have one. The problem is that I can't
> afford to give any more time to this problem as this Torque installation is for
> one research group amongst many and the three days I spent tying this down was
> more than I could really afford.

Just turn on core file generation "ulimit -c unlimited" and then run "gdb
pbs_sched core" with the generated core from the crash and post the stacktrace
here.

Btw. if it consistently crashes always after a certain amount of time, then
check the alarm timer in pbs_sched.c. If a scheduling loop runs for longer time
then specified, the scheduler is killed.
Comment 12 Stephen Usher 2010-11-10 02:52:56 MST
Unfortunately it's not deterministic. I also don't have a core currently
either.

I'm attempting a debug run now with the old code.
Comment 13 Stephen Usher 2010-11-10 04:58:03 MST
Of course, now that I've got the executable ready for debugging it's not
crashing!
Comment 14 Ken Nielson 2010-11-10 10:29:59 MST
(In reply to comment #13)
> Of course, now that I've got the executable ready for debugging it's not
> crashing!

When you have a patch you are confident will fix the problem we will get it
committed to the tree.
Comment 15 Stephen Usher 2010-11-11 03:06:37 MST
Created an attachment (id=60) [details]
job_info.c: Adds local copy of comment passed to pbs_alterjob() to stop random
SEGV (in sprintf()) and to safeguard data.

A little bit of a belt and braces approach (defensive programming), but better
to be safe than sorry if we don't know what pbs_alterjob() is doing to the data
we pass to it.

Because there's a possibility that pbs_alterjob() may change the value of the
pointer passed to it I'm assuming that it will and, for the cost of the size of
a pointer, keeping a local copy so that the memory free()'d at the end of the
function is guaranteed to be that which was allocated earlier.
Comment 16 Stephen Usher 2010-11-11 03:24:44 MST
Created an attachment (id=61) [details]
job_info.c: Adds local copy of comment in a static buffer to be passed tp
pbs_alterjob.

Sorry, previous patch causes more problems than it solves.

This is a new patch using a static buffer.
Comment 17 Stephen Usher 2010-11-11 04:36:10 MST
Hmm.. After installing that patch I now get a totally different crash:

(gdb) where
#0  0xffffe430 in __kernel_vsyscall ()
#1  0xf75aece1 in raise () from /lib/libc.so.6
#2  0xf75b0632 in abort () from /lib/libc.so.6
#3  0x08049d63 in catch_abort (sig=11) at pbs_sched.c:225
#4  <signal handler called>
#5  0x0804d8d2 in update_job_comment (pbs_sd=775369784, jinfo=0x3832312c, 
    comment=0xfffe4e78 "Not Running - PBS Error: Resource temporarily
unavailable REJHOST=everest MSG=cannot allocate node 'everest' to job - node
not currently available (nps needed/free: 1/-1,  joblist:
128863.newton:0,128"...)
    at job_info.c:695
#6  0x0804c7c0 in run_update_job (pbs_sd=775369784, sinfo=0x7477656e, 
    qinfo=0x303a6e6f, jinfo=0x3832312c) at fifo.c:702
#7  0x3832312c in ?? ()
#8  0x2e373438 in ?? ()
#9  0x7477656e in ?? ()
#10 0x303a6e6f in ?? ()
#11 0x3832312c in ?? ()
#12 0x2e323338 in ?? ()
#13 0x7477656e in ?? ()
#14 0x303a6e6f in ?? ()
#15 0xfffe0029 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Notice that it's not even getting as far as copying the comment!
Comment 18 Stephen Usher 2010-11-11 06:09:14 MST
Created an attachment (id=62) [details]
src/scheduler.cc/sample/fifo.c: sprintf() used unsafely with error messages
causing buffer overrun. Fix using snprintf().

Occasionally an error message is passed to run_update_job() which is far longer
than the default buffer size: 256 bytes.

Because there is no sanity checking of this string length it is blindly written
into the stack-based buffer, buf, causing stack corruption.

This patch parameterizes the the size of the buffer (using a pre-processor
definition), increases it to 1024 bytes and hardens the copy of the message
into the buffer using snprintf() to limit the length of the string being
copied.
Comment 19 Simon Toth 2010-11-11 06:45:57 MST
(In reply to comment #18)
> Created an attachment (id=62) [details] [details]
> src/scheduler.cc/sample/fifo.c: sprintf() used unsafely with error messages
> causing buffer overrun. Fix using snprintf().
> 
> Occasionally an error message is passed to run_update_job() which is far longer
> than the default buffer size: 256 bytes.
> 
> Because there is no sanity checking of this string length it is blindly written
> into the stack-based buffer, buf, causing stack corruption.
> 
> This patch parameterizes the the size of the buffer (using a pre-processor
> definition), increases it to 1024 bytes and hardens the copy of the message
> into the buffer using snprintf() to limit the length of the string being
> copied.

This patch looks good.
Comment 20 Glen 2010-11-18 20:38:19 MST
this final patch looks good to me
I can try to get it committed when I'm back from Supercomputing
Comment 21 Glen 2010-11-22 11:48:27 MST
I'm planning on committing this patch to trunk, 2.5-fixes, and 2.4-fixes -
hopefully within the next 24hours
Comment 22 Glen 2010-11-22 19:00:00 MST
patch to guard against buffer overruns pbs_sched applied in trunk, 2.4-fixes,
and 2.5-fixes. Stephen Usher credited for patch in CHANGELOG.