[torquedev] pbs_mom segfault - 2.0.0p7

Garrick Staples garrick at usc.edu
Fri Feb 24 13:10:42 MST 2006


On Mon, Feb 13, 2006 at 06:38:06PM -0700, Marcus R. Epperson alleged:
> Today we saw a rank 0 pbs_mom segfault on a large job.  Here is the 
> backtrace:
> 
> #0  0x000000301132e37d in raise () from /lib64/tls/libc.so.6
> #1  0x000000301132faae in abort () from /lib64/tls/libc.so.6
> #2  0x00000000004069af in catch_abort (sig=11) at mom_main.c:2791
> #3  <signal handler called>
> #4  0x00000030113688c5 in free () from /lib64/tls/libc.so.6
> #5  0x000000000040d0f1 in arrayfree (array=0x96acf0) at mom_comm.c:984
> #6  0x00000000004129d7 in tm_request (fd=11, version=1) at mom_comm.c:4463

The mom_comm code looks correct to me (though I'm disturbed by the use
of assert() to catch failed allocs).

It makes me wonder if disrst() scribbled over envp.  We do have a
history of buggy encoding/decoding of "arst" strings, but I thought I
pretty much had that fixed in 2.0.0p5.  Would you still have that job's
.JB file on server and MOM?  I'm wondering if the env var list was
broken in some way.

      envp = (char **)calloc(numele, sizeof(char **));

      assert(envp);

      for (i = 0;;i++)
        {
        char *env;

        env = disrst(fd,&ret);

        if ((ret != DIS_SUCCESS) && (ret != DIS_EOD))
          {
          arrayfree(argv);
          arrayfree(envp);    <-- line 4463
          
          goto done;
          }
        ...
        envp[i] = env;
        }

...
  for (i = 0;array[i];i++)
    free(array[i]);    <-- line 984

  free(array);

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20060224/c6cb31c2/attachment.bin


More information about the torquedev mailing list