[torquedev] memory issue in que_resolv() ?
Garrick Staples
garrick at clusterresources.com
Thu Sep 28 00:13:24 MDT 2006
On Thu, Sep 28, 2006 at 02:04:54AM -0400, Ian Stakenvicius alleged:
> [ Oops, i thought i had posted this earlier but i guess i just replied to
> myself. ]
>
> The reason I requested the dot-filename filter was because pbs_server was
> crashing on some job operations and seeming to not load my queues correctly.
> After browsing the code, i'm wondering about this:
>
> ===[ queue_recov.c, lines 265-... ]=====
> /* read in queue save sub-structure */
>
> if (read(fds, (char *)&pq->qu_qs, sizeof(struct queuefix)) !=
> sizeof(struct queuefix)) {
> log_err(errno, "que_recov", "read error");
> free((char *)pq);
> (void)close(fds);
> return ((pbs_queue *)0);
> }
> =====
>
> Note that the pq structure is freed w/o adjusting server.sv_qs.sv_numqueue
> or delete_link()'ing the pq.qu_link... Does this really cause no issues at
> this stage? Wouldn't it be better to be calling que_free() here (and above)
> instead?
>
> Reading through append_link(), it seems to me that free'ing an entry that
> hasn't been unlinked yet will break stuff, as it'll invalidate the head's
> ll_prior.. I haven't looked too far into this, so please correct me if i'm
> wrong.
>
> I did apply the following change though, and have not had any issues since:
>
> --- src/server/queue_recov.c 2006-07-27 18:53:53.000000000 -0400
> +++ src/server/queue_recov.c 2006-09-27 13:34:33.000000000 -0400
> @@ -257,7 +257,7 @@
> {
> log_err(errno,"que_recov","open error");
>
> - free((char *)pq);
> + que_free(pq); /* was free((char *)pq); */
>
> return(NULL);
> }
> @@ -267,7 +267,7 @@
> if (read(fds, (char *)&pq->qu_qs, sizeof(struct queuefix)) !=
> sizeof(struct queuefix)) {
> log_err(errno, "que_recov", "read error");
> - free((char *)pq);
> + que_free(pq); /* was free((char *)pq); */
> (void)close(fds);
> return ((pbs_queue *)0);
> }
I'd say you are correct. I've committed your patch to 2.1-fixes and
trunk.
I suspect this code was never really tested because a failure to
recovery a queue means you have bigger problems :)
More information about the torquedev
mailing list