[torquedev] Double free and touches of freed memory inside pbs_server

Glen Beane glen.beane at gmail.com
Thu Aug 5 13:01:43 MDT 2010


On Thu, Aug 5, 2010 at 1:26 PM, Eygene Ryabinkin <rea+maui at grid.kiae.ru> wrote:
> Good day.
>
> It looks like I digged the case where pbs_server will free the memory,
> then touch it and then will free it again.  I had experienced it with
> 2.5.1, but it looks like most versions should have this problem.
>
> Here's what happens:
>  - modifyjob request comes in, process_request() will allocate
>   new request with alloc_br();
>  - then dispatch_request() will call req_modifyjob() that in turn
>   will call modify_job() and which in some cases (when job attributes
>   are to be changed) will call relay_to_mom();
>  - relay_to_mom() will insert this request (allocated with alloc_br())
>   into task_list_event (by calling issue_Drequest());
>  - modify_job() will do its job and req_modifyjob() will call
>   reply_ack() that will invoke reply_send();
>  - reply_send() sends the reply and calls free_br() on our request;
>   _but_ the same request was pushed to the task_list_event, so
>   once the MOM will reply, pbs_server will touch the freed memory
>   chunk and will free it once again.
>
> Since there can be modifications of multiple jobs per one client's
> request (via req_modifyarray()) and it is rather hard to make a proper
> deep copy of a request (at least, it is hard for me), I ended up with a
> simple refcounting patch.  It works in the sense that pbs_server stopped
> to dump core (because glibc detects double frees on CentOS 5.5 and calls
> abort()), but pbs_server for 2.5.1 was responding to the requests like
> 'qstat -Bf' very slowly (with and without my patch), so I had rolled
> back to 2.4.9 at our production infrastructure.
>
> The patch is attached and it will be very good if someone will be
> able to evaluate both the patch and the logics above.
>
> Meanwhile, I will try to backport the patch for 2.4.9 and use it
> on our production systems.
>
> Thanks!
> --
> Eygene Ryabinkin, Russian Research Centre "Kurchatov Institute"
>
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev
>
>

hi Eygene, thank you for the patch.   If you would, could you please
open a bugzilla bug for this issue and attach your patch?  That way if
a developer isn't able to look at this right away it won't get lost in
their inbox.

www.clusterresources.com/bugzilla


More information about the torquedev mailing list