Bug 75 - Double free's and touches of freed memory inside pbs_server
: Double free's and touches of freed memory inside pbs_server
Status: RESOLVED FIXED
Product: TORQUE
pbs_server
: 2.5.x
: Other Linux
: P5 major
Assigned To: Glen
:
:
:
  Show dependency treegraph
 
Reported: 2010-08-05 13:43 MDT by Eygene Ryabinkin
Modified: 2010-08-14 16:05 MDT (History)
3 users (show)

See Also:


Attachments
My notes on this bug. (12.67 KB, text/plain)
2010-08-05 13:43 MDT, Eygene Ryabinkin
Details
My patch the implements refcounting for alloc_br()/free_br(). (2.72 KB, patch)
2010-08-05 13:44 MDT, Eygene Ryabinkin
Details | Diff
Add error code PBSE_RELAYED_TO_MOM to modfiy_job in src/server/req_modify.c (1.18 KB, patch)
2010-08-09 11:36 MDT, Ken Nielson
Details | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description Eygene Ryabinkin 2010-08-05 13:43:09 MDT
Created an attachment (id=47) [details]
My notes on this bug.

It looks like I digged the case where pbs_server will free the memory,
then touch it and then will free it again.  I had experienced it with
2.5.1, but it looks like most versions should have this problem.

Here's what happens:
 - modifyjob request comes in, process_request() will allocate
   new request with alloc_br();
 - then dispatch_request() will call req_modifyjob() that in turn
   will call modify_job() and which in some cases (when job attributes
   are to be changed) will call relay_to_mom();
 - relay_to_mom() will insert this request (allocated with alloc_br())
   into task_list_event (by calling issue_Drequest());
 - modify_job() will do its job and req_modifyjob() will call
   reply_ack() that will invoke reply_send();
 - reply_send() sends the reply and calls free_br() on our request;
   _but_ the same request was pushed to the task_list_event, so
   once the MOM will reply, pbs_server will touch the freed memory
   chunk and will free it once again.

Since there can be modifications of multiple jobs per one client's
request (via req_modifyarray()) and it is rather hard to make a proper
deep copy of a request (at least, it is hard for me), I ended up with a
simple refcounting patch.  It works in the sense that pbs_server stopped
to dump core (because glibc detects double frees on CentOS 5.5 and calls
abort()), but pbs_server for 2.5.1 was responding to the requests like
'qstat -Bf' very slowly (with and without my patch), so I had rolled
back to 2.4.9 at our production infrastructure.

Attached are two files:
 1. my patch that implements refcounting for alloc_br()/free_br();
 2. my notes on this bug -- developers can read them and accept/deny my
findings.
Comment 1 Eygene Ryabinkin 2010-08-05 13:44:11 MDT
Created an attachment (id=48) [details]
My patch the implements refcounting for alloc_br()/free_br().
Comment 2 Ken Nielson 2010-08-09 11:36:23 MDT
Created an attachment (id=49) [details]
Add error code PBSE_RELAYED_TO_MOM to modfiy_job in src/server/req_modify.c

The bug which caused this double free error was introduced when the function
modify_job was added to req_modifyjob. Previously req_modifyjob called
relay_to_mom directly and returned without freeing the batch_request structure.
This patch allows modify_job to return an error code to req_modifyjob letting
it know a call was made to relay_to_mom and req_modify job does not need to
call reply_ack (where the batch_request buffer will be freed)
Comment 3 Eygene Ryabinkin 2010-08-10 00:07:47 MDT
It will work for a single relayed request.  But what about req_modifyarray()? 
It will use the single struct batch_request for calling modify_job() via
modify_whole_array()/modify_array_range(), so again, single batch_request will
end up freed up a number of times.
Comment 4 Ken Nielson 2010-08-10 10:07:24 MDT
(In reply to comment #3)
> It will work for a single relayed request.  But what about req_modifyarray()? 
> It will use the single struct batch_request for calling modify_job() via
> modify_whole_array()/modify_array_range(), so again, single batch_request will
> end up freed up a number of times.

There is only one place in req_deletearray where the batch_request buffer is
passed as a parameter to a task. The task is set to run array_delete_wt.
However, req_deletearray does not do a reply_ack for this case and returns as
it is suppose to. 

There is a problem in that the code does not check for a successful return from
set_task. I will fix that and check it in.
Comment 5 Eygene Ryabinkin 2010-08-10 10:21:15 MDT
(In reply to comment #4)
> There is only one place in req_deletearray where the batch_request buffer is
> passed as a parameter to a task. The task is set to run array_delete_wt.
> However, req_deletearray does not do a reply_ack for this case and returns as
> it is suppose to. 

I am talking about req_modifyarray() that will call modify_job() multiple
times, so a single batch_request will be subscribed multiple times into the
task_list_event with post_modify_req() as the event handler.  So again,
batch_request will be freed multiple times.  Am I missing something?
Comment 6 Ken Nielson 2010-08-14 16:05:25 MDT
> I am talking about req_modifyarray() that will call modify_job() multiple
> times, so a single batch_request will be subscribed multiple times into the
> task_list_event with post_modify_req() as the event handler.  So again,
> batch_request will be freed multiple times.  Am I missing something?

I checked in a fix for the req_modifyarray. It required considerable more work.
I had to do a deep copy of the batch request structure for each array element
that had a running job and needed to contact the mom. I also needed to create
an new callback function post_modify_arrayreq to handle the array
considerations.