[torquedev] pbs_mom crashing

Glen Beane glen.beane at gmail.com
Tue Jul 21 20:25:47 MDT 2009


the code in job_free referenced by that stack trace looks like this:

  /* remove any malloc working attribute space */

  for (i = 0;i < (int)JOB_ATR_LAST;i++)
    {
    job_attr_def[i].at_free(&pj->ji_wattr[i]);  /* this is line 509!! */
    }


and the free_str that is being called ( via job_attr_def[i].at_free())
looks like this:

void free_str(

  struct attribute *attr)

  {
  if ((attr->at_flags & ATR_VFLAG_SET) && (attr->at_val.at_str != NULL))
    {
    free(attr->at_val.at_str);
    }

  attr->at_val.at_str = NULL;

  attr->at_flags &= ~ATR_VFLAG_SET;

  return;
  }  /* END free_str() */


so as we can see, free should not be called unless the attribute
string had previously been allocated and has not been free'd yet.
There are a few things I've come up with that could have happened that
would cause the call to free() in free_str() to have problems,  I
haven't checked the core closely enough to figure out if any of these
are even consistent with the observed failure

1)  somewhere pbs_mom had already free'd the string storing the
attribute value, and it was done incorrectly (by directly calling
free() and not unsetting ATR_VFLAG_SET or setting the pointer to NULL)
so we attempt to free() the string again

2) this is a new attribute definition and its free function is
incorrectly set to free_str when it is not really a string (unlikely,
unless this attribute is rarely set)

3) the attribute at_val.at_str is getting corrupted somehow so we call
free() on an invalid pointer

And I'm sure I'm missing other plausible failures.

what would be really helpful is to know the value of [i] in this line:
job_attr_def[i].at_free(&pj->ji_wattr[i]);
(src/resmom/job_func.c:509). Then I would be able to map that back to
which attribute we are trying to free when pbs_mom crashes.

-glen




On Tue, Jul 21, 2009 at 4:12 PM, Oliver Baltzer<obaltzer at flagstonere.bm> wrote:
> Hi Josh,
>
> Josh Butikofer wrote:
>>
>> For the lazy of us, would it be possible for you to just open the core
>> file in
>> GDB and do a quick "where" and send that? If simple enough, we may be
>> able to
>> fix it just from that.
>>
>> gdb pbs_mom -c <COREFILE>
>>  >where
>>
> Here you go:
>
> [root at cyclone mom_priv]# gdb
> ~obaltzer/build/rpm/BUILD/torque-2.3.7/src/resmom/.libs/pbs_mom core.24232
> #0  0x0000003d0362e21d in raise () from /lib64/tls/libc.so.6
> (gdb) where
> #0  0x0000003d0362e21d in raise () from /lib64/tls/libc.so.6
> #1  0x0000003d0362fa1e in abort () from /lib64/tls/libc.so.6
> #2  0x0000003d03663291 in __libc_message () from /lib64/tls/libc.so.6
> #3  0x0000003d03668eae in _int_free () from /lib64/tls/libc.so.6
> #4  0x0000003d036691f6 in free () from /lib64/tls/libc.so.6
> #5  0x0000000000431915 in free_str (attr=0x5d7d50) at attr_fn_str.c:352
> #6  0x0000000000426814 in job_free (pj=0x5d7ae0) at job_func.c:509
> #7  0x0000000000426b72 in job_purge (pjob=0x5d7ae0) at job_func.c:730
> #8  0x000000000041b696 in req_deletejob (preq=0x5d5f20) at requests.c:1006
> #9  0x00000000004285b5 in process_request (sfds=10) at
> ../server/process_request.c:651
> #10 0x0000002a9559a2c4 in wait_request (waittime=Variable "waittime" is
> not available.
> ) at ../Libnet/net_server.c:480
> #11 0x000000000041718f in main_loop () at mom_main.c:8078
> #12 0x000000000041751c in main (argc=1, argv=0x7fbffff648) at
> mom_main.c:8180
>
> Note, the core dump was from a non-debug build, but it looks like that
> does not make a difference.
>
> Cheers,
> Oliver
>
> **********************************************************************
> This communication contains information which is confidential and may also be legally privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s), disclosure, copying, distribution, or other use of, or action taken or omitted to be taken in reliance upon, this communication or the information in it is prohibited and maybe unlawful. If you have received this communication in error please notify the sender by return email, delete it from your system and destroy any copies.
> **********************************************************************
>
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev
>


More information about the torquedev mailing list