Bugzilla – Bug 75
Double free's and touches of freed memory inside pbs_server
Last modified: 2010-08-14 16:05:25 MDT
You need to log in before you can comment on or make changes to this bug.
Created an attachment (id=47) [details] My notes on this bug. It looks like I digged the case where pbs_server will free the memory, then touch it and then will free it again. I had experienced it with 2.5.1, but it looks like most versions should have this problem. Here's what happens: - modifyjob request comes in, process_request() will allocate new request with alloc_br(); - then dispatch_request() will call req_modifyjob() that in turn will call modify_job() and which in some cases (when job attributes are to be changed) will call relay_to_mom(); - relay_to_mom() will insert this request (allocated with alloc_br()) into task_list_event (by calling issue_Drequest()); - modify_job() will do its job and req_modifyjob() will call reply_ack() that will invoke reply_send(); - reply_send() sends the reply and calls free_br() on our request; _but_ the same request was pushed to the task_list_event, so once the MOM will reply, pbs_server will touch the freed memory chunk and will free it once again. Since there can be modifications of multiple jobs per one client's request (via req_modifyarray()) and it is rather hard to make a proper deep copy of a request (at least, it is hard for me), I ended up with a simple refcounting patch. It works in the sense that pbs_server stopped to dump core (because glibc detects double frees on CentOS 5.5 and calls abort()), but pbs_server for 2.5.1 was responding to the requests like 'qstat -Bf' very slowly (with and without my patch), so I had rolled back to 2.4.9 at our production infrastructure. Attached are two files: 1. my patch that implements refcounting for alloc_br()/free_br(); 2. my notes on this bug -- developers can read them and accept/deny my findings.
Created an attachment (id=48) [details] My patch the implements refcounting for alloc_br()/free_br().
Created an attachment (id=49) [details] Add error code PBSE_RELAYED_TO_MOM to modfiy_job in src/server/req_modify.c The bug which caused this double free error was introduced when the function modify_job was added to req_modifyjob. Previously req_modifyjob called relay_to_mom directly and returned without freeing the batch_request structure. This patch allows modify_job to return an error code to req_modifyjob letting it know a call was made to relay_to_mom and req_modify job does not need to call reply_ack (where the batch_request buffer will be freed)
It will work for a single relayed request. But what about req_modifyarray()? It will use the single struct batch_request for calling modify_job() via modify_whole_array()/modify_array_range(), so again, single batch_request will end up freed up a number of times.
(In reply to comment #3) > It will work for a single relayed request. But what about req_modifyarray()? > It will use the single struct batch_request for calling modify_job() via > modify_whole_array()/modify_array_range(), so again, single batch_request will > end up freed up a number of times. There is only one place in req_deletearray where the batch_request buffer is passed as a parameter to a task. The task is set to run array_delete_wt. However, req_deletearray does not do a reply_ack for this case and returns as it is suppose to. There is a problem in that the code does not check for a successful return from set_task. I will fix that and check it in.
(In reply to comment #4) > There is only one place in req_deletearray where the batch_request buffer is > passed as a parameter to a task. The task is set to run array_delete_wt. > However, req_deletearray does not do a reply_ack for this case and returns as > it is suppose to. I am talking about req_modifyarray() that will call modify_job() multiple times, so a single batch_request will be subscribed multiple times into the task_list_event with post_modify_req() as the event handler. So again, batch_request will be freed multiple times. Am I missing something?
> I am talking about req_modifyarray() that will call modify_job() multiple > times, so a single batch_request will be subscribed multiple times into the > task_list_event with post_modify_req() as the event handler. So again, > batch_request will be freed multiple times. Am I missing something? I checked in a fix for the req_modifyarray. It required considerable more work. I had to do a deep copy of the batch request structure for each array element that had a running job and needed to contact the mom. I also needed to create an new callback function post_modify_arrayreq to handle the array considerations.