[torquedev] [Bug 75] New: Double free's and touches of freed memory inside pbs_server
bugzilla-daemon at supercluster.org
bugzilla-daemon at supercluster.org
Thu Aug 5 13:43:10 MDT 2010
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=75
Summary: Double free's and touches of freed memory inside
pbs_server
Product: TORQUE
Version: 2.5.x
Platform: Other
OS/Version: Linux
Status: NEW
Severity: enhancement
Priority: P5
Component: pbs_server
AssignedTo: glen.beane at gmail.com
ReportedBy: rea+maui at grid.kiae.ru
CC: torquedev at supercluster.org
Estimated Hours: 0.0
Created an attachment (id=47)
--> (http://www.clusterresources.com/bugzilla/attachment.cgi?id=47)
My notes on this bug.
It looks like I digged the case where pbs_server will free the memory,
then touch it and then will free it again. I had experienced it with
2.5.1, but it looks like most versions should have this problem.
Here's what happens:
- modifyjob request comes in, process_request() will allocate
new request with alloc_br();
- then dispatch_request() will call req_modifyjob() that in turn
will call modify_job() and which in some cases (when job attributes
are to be changed) will call relay_to_mom();
- relay_to_mom() will insert this request (allocated with alloc_br())
into task_list_event (by calling issue_Drequest());
- modify_job() will do its job and req_modifyjob() will call
reply_ack() that will invoke reply_send();
- reply_send() sends the reply and calls free_br() on our request;
_but_ the same request was pushed to the task_list_event, so
once the MOM will reply, pbs_server will touch the freed memory
chunk and will free it once again.
Since there can be modifications of multiple jobs per one client's
request (via req_modifyarray()) and it is rather hard to make a proper
deep copy of a request (at least, it is hard for me), I ended up with a
simple refcounting patch. It works in the sense that pbs_server stopped
to dump core (because glibc detects double frees on CentOS 5.5 and calls
abort()), but pbs_server for 2.5.1 was responding to the requests like
'qstat -Bf' very slowly (with and without my patch), so I had rolled
back to 2.4.9 at our production infrastructure.
Attached are two files:
1. my patch that implements refcounting for alloc_br()/free_br();
2. my notes on this bug -- developers can read them and accept/deny my
findings.
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
More information about the torquedev
mailing list