[torquedev] [Bug 96] handle failed stagein jobs properly

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Fri Nov 5 08:25:56 MDT 2010


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=96

Simon Toth <SimonT at mail.muni.cz> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |SimonT at mail.muni.cz

--- Comment #1 from Simon Toth <SimonT at mail.muni.cz> 2010-11-05 08:25:56 MDT ---
This is the code handling this state. From looking at it, it seems pretty
reasonable. The job is held and owner is mailed so he will either delete, or
unhold the job (although I don't know how that can be done). Plus jobs seem to
be held in this state for only 1800 seconds (30 minutes).

    if (code != 0)
      {
      /* stage in failed - hold job */

      free_nodes(pjob);

      pwait = &pjob->ji_wattr[(int)JOB_ATR_exectime];

      if ((pwait->at_flags & ATR_VFLAG_SET) == 0)
        {
        pwait->at_val.at_long = time_now + PBS_STAGEFAIL_WAIT;

        pwait->at_flags |= ATR_VFLAG_SET;

        job_set_wait(pwait, pjob, 0);
        }

      svr_setjobstate(pjob, JOB_STATE_WAITING, JOB_SUBSTATE_STAGEFAIL);

      if (preq->rq_reply.brp_choice == BATCH_REPLY_CHOICE_Text)
        {
        /* set job comment */

        /* NYI */

        svr_mailowner(
          pjob,
          MAIL_STAGEIN,
          MAIL_FORCE,
          preq->rq_reply.brp_un.brp_txt.brp_str);
        }
      }
    else

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list