[torquedev] [Bug 75] New: Double free's and touches of freed memory inside pbs_server

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Thu Aug 5 13:43:10 MDT 2010


           Summary: Double free's and touches of freed memory inside
           Product: TORQUE
           Version: 2.5.x
          Platform: Other
        OS/Version: Linux
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: pbs_server
        AssignedTo: glen.beane at gmail.com
        ReportedBy: rea+maui at grid.kiae.ru
                CC: torquedev at supercluster.org
   Estimated Hours: 0.0

Created an attachment (id=47)
 --> (http://www.clusterresources.com/bugzilla/attachment.cgi?id=47)
My notes on this bug.

It looks like I digged the case where pbs_server will free the memory,
then touch it and then will free it again.  I had experienced it with
2.5.1, but it looks like most versions should have this problem.

Here's what happens:
 - modifyjob request comes in, process_request() will allocate
   new request with alloc_br();
 - then dispatch_request() will call req_modifyjob() that in turn
   will call modify_job() and which in some cases (when job attributes
   are to be changed) will call relay_to_mom();
 - relay_to_mom() will insert this request (allocated with alloc_br())
   into task_list_event (by calling issue_Drequest());
 - modify_job() will do its job and req_modifyjob() will call
   reply_ack() that will invoke reply_send();
 - reply_send() sends the reply and calls free_br() on our request;
   _but_ the same request was pushed to the task_list_event, so
   once the MOM will reply, pbs_server will touch the freed memory
   chunk and will free it once again.

Since there can be modifications of multiple jobs per one client's
request (via req_modifyarray()) and it is rather hard to make a proper
deep copy of a request (at least, it is hard for me), I ended up with a
simple refcounting patch.  It works in the sense that pbs_server stopped
to dump core (because glibc detects double frees on CentOS 5.5 and calls
abort()), but pbs_server for 2.5.1 was responding to the requests like
'qstat -Bf' very slowly (with and without my patch), so I had rolled
back to 2.4.9 at our production infrastructure.

Attached are two files:
 1. my patch that implements refcounting for alloc_br()/free_br();
 2. my notes on this bug -- developers can read them and accept/deny my

Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

More information about the torquedev mailing list