[torquedev] memory issue in que_resolv() ?

Garrick Staples garrick at clusterresources.com
Thu Sep 28 00:13:24 MDT 2006


On Thu, Sep 28, 2006 at 02:04:54AM -0400, Ian Stakenvicius alleged:
> [ Oops, i thought i had posted this earlier but i guess i just replied to
> myself. ]
> 
> The reason I requested the dot-filename filter was  because pbs_server was
> crashing on some job operations and seeming to not load my queues correctly.
>  After browsing the code, i'm wondering about this:
> 
> ===[ queue_recov.c, lines 265-... ]=====
>         /* read in queue save sub-structure */
> 
>         if (read(fds, (char *)&pq->qu_qs, sizeof(struct queuefix)) !=
>             sizeof(struct queuefix)) {
>                 log_err(errno, "que_recov", "read error");
>                 free((char *)pq);
>                 (void)close(fds);
>                 return ((pbs_queue *)0);
>         }
> =====
> 
> Note that the pq structure is freed w/o adjusting server.sv_qs.sv_numqueue
> or delete_link()'ing the pq.qu_link...  Does this really cause no issues at
> this stage?  Wouldn't it be better to be calling que_free() here (and above)
> instead?
> 
> Reading through append_link(), it seems to me that free'ing an entry that
> hasn't been unlinked yet will break stuff, as it'll invalidate the head's
> ll_prior.. I haven't looked too far into this, so please correct me if i'm
> wrong.
> 
> I did apply the following change though, and have not had any issues since:
> 
> --- src/server/queue_recov.c	2006-07-27 18:53:53.000000000 -0400
> +++ src/server/queue_recov.c	2006-09-27 13:34:33.000000000 -0400
> @@ -257,7 +257,7 @@
>      {
>      log_err(errno,"que_recov","open error");
> 
> -    free((char *)pq);
> +    que_free(pq);  /* was free((char *)pq); */
> 
>      return(NULL);
>      }
> @@ -267,7 +267,7 @@
>  	if (read(fds, (char *)&pq->qu_qs, sizeof(struct queuefix)) !=
>  	    sizeof(struct queuefix)) {
>  		log_err(errno, "que_recov", "read error");
> -		free((char *)pq);
> +		que_free(pq);  /* was free((char *)pq); */
>  		(void)close(fds);
>  		return ((pbs_queue *)0);
>  	}

I'd say you are correct.  I've committed your patch to 2.1-fixes and
trunk.

I suspect this code was never really tested because a failure to
recovery a queue means you have bigger problems :)



More information about the torquedev mailing list