[torquedev] Double free and touches of freed memory inside pbs_server
rea+maui at grid.kiae.ru
Thu Aug 5 11:26:40 MDT 2010
It looks like I digged the case where pbs_server will free the memory,
then touch it and then will free it again. I had experienced it with
2.5.1, but it looks like most versions should have this problem.
Here's what happens:
- modifyjob request comes in, process_request() will allocate
new request with alloc_br();
- then dispatch_request() will call req_modifyjob() that in turn
will call modify_job() and which in some cases (when job attributes
are to be changed) will call relay_to_mom();
- relay_to_mom() will insert this request (allocated with alloc_br())
into task_list_event (by calling issue_Drequest());
- modify_job() will do its job and req_modifyjob() will call
reply_ack() that will invoke reply_send();
- reply_send() sends the reply and calls free_br() on our request;
_but_ the same request was pushed to the task_list_event, so
once the MOM will reply, pbs_server will touch the freed memory
chunk and will free it once again.
Since there can be modifications of multiple jobs per one client's
request (via req_modifyarray()) and it is rather hard to make a proper
deep copy of a request (at least, it is hard for me), I ended up with a
simple refcounting patch. It works in the sense that pbs_server stopped
to dump core (because glibc detects double frees on CentOS 5.5 and calls
abort()), but pbs_server for 2.5.1 was responding to the requests like
'qstat -Bf' very slowly (with and without my patch), so I had rolled
back to 2.4.9 at our production infrastructure.
The patch is attached and it will be very good if someone will be
able to evaluate both the patch and the logics above.
Meanwhile, I will try to backport the patch for 2.4.9 and use it
on our production systems.
Eygene Ryabinkin, Russian Research Centre "Kurchatov Institute"
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 2790 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20100805/53b16f50/attachment-0001.bin
More information about the torquedev