[torquedev] superfluous write in start_process()

Ken Nielson knielson at clusterresources.com
Mon Apr 27 15:41:39 MDT 2009


Hi all,

The function start_process in src/resmom/start_exec.c seems to have a superfluous write() in the parent code after the fork. Following is the section of code where this occurs:

 /*
  ** Begin a new process for the fledgling task.
  */

  if ((pid = fork_me(-1)) == -1)
    {
    /* fork failed */

    return(-1);
    }

  if (pid != 0)
    {
    /* parent */

    int gotsuccess = 0;

    close(kid_read);
    close(kid_write);

    /* read sid */

    for (;;)
      {
      i = read(parent_read, (char *) & sjr, sizeof(sjr));

      if ((i == -1) && (errno == EINTR))
        continue;

      if ((i == sizeof(sjr)) && (sjr.sj_code == 0) && !gotsuccess)
        {
        gotsuccess = 1;

        if (write(parent_write, &sjr, sizeof(sjr)) == -1) {}

        continue;
        }

      if (gotsuccess)
        {
        i = sizeof(sjr);
        }

      break;
      }  /* END for(;;) */

    j = errno;

    close(parent_read);

    if (i != sizeof(sjr))
      {
      sprintf(log_buffer, "read of pipe for sid job %s got %d not %ld (errno: %d, %s)",
              pjob->ji_qs.ji_jobid,
              i,
              (long)sizeof(sjr),
              j,
              strerror(j));

      log_err(j, id, log_buffer);

      close(parent_write);

      return(-1);
      }

*************** This is the write that seems to be extra ********************
    if (write(parent_write, &sjr, sizeof(sjr)) == -1) {}

    close(parent_write);


The Child process calls starter_return which writes a code to the parent_read pipe and then waits for the acknowledgment from the parent. Once the acknowledgment is received the child closes the child_read pipe. The parent then tries to write once more to that pipe, however, a race condition exists and if the child reads and closes the pipe before the parent can write to it again a SIGPIPE is sent to the parent and pbs_mom is terminated. 

This write() statement seems to have been in place from the beginning. I was working on the 2.4 branch and was able to reproduce the SIGPIPE every time. I commented out the write() statement and the code worked as expected.

Is there a reason we should not remove this write() statement from the code?

Ken Nielson
Cluster Resources 


More information about the torquedev mailing list