Bug 96 - handle failed stagein jobs properly
: handle failed stagein jobs properly
Status: NEW
Product: TORQUE
pbs_server
: 2.4.x
: PC Linux
: P5 enhancement
Assigned To: Glen
:
:
:
  Show dependency treegraph
 
Reported: 2010-11-05 07:45 MDT by ramon.bastiaans
Modified: 2010-11-17 04:27 MST (History)
2 users (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description ramon.bastiaans 2010-11-05 07:45:02 MDT
Whenever a job uses stagein and staging in of files fails (because perhaps the
source file does not exist), the job is stuck in Waiting status. The job
already has a CPU/node/slot assigned but won't start anymore.

The job stays in Waiting status forever, clogging the scheduling.

Here is how it shows in the pbs_server log:

11/03/2010 11:35:22;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
setting job 8056228.ce.XXX state from TRANSIT-TRANSICM to 
QUEUED-PRESTAGEIN (1-11)
11/03/2010 11:35:22;0100;PBS_Server;Job;8056228.ce.XXX;enqueuing into 
infra, state 1 hop 1
11/03/2010 11:35:22;0008;PBS_Server;Job;8056228.ce.XXX;Job Queued at 
request of ramonb@ce.XXX, owner = ramonb@ce.XXX, job name = 
testjob.sh, queue = infra
11/03/2010 11:35:30;0008;PBS_Server;Job;8056228.ce.XXX;Job Run at 
request of root@ce.XXX
11/03/2010 11:35:30;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
setting job 8056228.ce.XXX state from QUEUED-PRESTAGEIN to 
RUNNING-STAGEGO (4-15)
11/03/2010 11:35:35;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
setting job 8056228.ce.XXX state from RUNNING-STAGEGO to 
WAITING-STAGEFAIL (3-37)

The job never comes out of this WAITING-STAGEFAIL status anymore.

In qstat these job end up looking like this:

 7921738.ce.     alisgm05 medium   crm01_325857094     --      1   1
 --  36:00 W   --
    am94-05/1

Where am94-05 is the hostname of the compute node assigned to it.

Another related problem is that prologue is only executed _after_ staging has
been completed, which prevents any stage testing to be done from the prologue.

As an side effect while jobs are stuck in the WAITING-STAGEFAIL status, they
prevent new jobs from being scheduled. Once the amount of stuck waiting jobs
reaches maui's reservationdepth, no new reservations can be created anymore and
the entire scheduling is stuck in a deadlock.

Please implement a proper way of handling failed stage jobs, i.e. an option to
either:
 * automatically delete/discard failed stage jobs
 * automatically hold failed stage jobs
 * and/or possibly run prologue _before_ staging takes place, not after
 * etc
Comment 1 Simon Toth 2010-11-05 08:25:56 MDT
This is the code handling this state. From looking at it, it seems pretty
reasonable. The job is held and owner is mailed so he will either delete, or
unhold the job (although I don't know how that can be done). Plus jobs seem to
be held in this state for only 1800 seconds (30 minutes).

    if (code != 0)
      {
      /* stage in failed - hold job */

      free_nodes(pjob);

      pwait = &pjob->ji_wattr[(int)JOB_ATR_exectime];

      if ((pwait->at_flags & ATR_VFLAG_SET) == 0)
        {
        pwait->at_val.at_long = time_now + PBS_STAGEFAIL_WAIT;

        pwait->at_flags |= ATR_VFLAG_SET;

        job_set_wait(pwait, pjob, 0);
        }

      svr_setjobstate(pjob, JOB_STATE_WAITING, JOB_SUBSTATE_STAGEFAIL);

      if (preq->rq_reply.brp_choice == BATCH_REPLY_CHOICE_Text)
        {
        /* set job comment */

        /* NYI */

        svr_mailowner(
          pjob,
          MAIL_STAGEIN,
          MAIL_FORCE,
          preq->rq_reply.brp_un.brp_txt.brp_str);
        }
      }
    else
Comment 2 ramon.bastiaans 2010-11-05 08:43:09 MDT
well, for one, we have mailing disabled by Torque using;
qmgr -c 'set server mail_domain = never'

So that email will never get sent to the end user. Furthermore, our end users
can not login to our system, we have no interactive (shell) access machine. So
they will never be able to retrieve that email.

Why won't the job the enter the "H" held/hold state if it's working as
designed, but stays in the "W"aiting status? Seems to me the node assignment
should be released if this happens and now not it is not (probably because it
stays in Waiting status?).

Secondly, why hold it 30 minutes? A lot can happen in 30 minutes.

It would be nice to make that policy configurable. Like I said to either;
 * delete/discard
 * hold (really hold, not stay in waiting)
 * hold for an configurable amount of time

Perhaps something like:

set server stagefail_policy = <hold|discard>
set server stagefail_holdtimeout = 90
Comment 3 Simon Toth 2010-11-05 18:38:33 MDT
(In reply to comment #2)
> Why won't the job the enter the "H" held/hold state if it's working as
> designed, but stays in the "W"aiting status? Seems to me the node assignment
> should be released if this happens and now not it is not (probably because it
> stays in Waiting status?).

As far as I know, held status is only for queued jobs. It is used to determine
if a job is ready to be run. This is a different state.

Looking at the code, the jobs should release back into queued once the timeout
is off.

> Secondly, why hold it 30 minutes? A lot can happen in 30 minutes.

I don't know, it's just a preset (I consider it quite reasonable). Its a
compile time constant, you can easily change it if you don't like it.

> It would be nice to make that policy configurable. Like I said to either;
>  * delete/discard
>  * hold (really hold, not stay in waiting)
>  * hold for an configurable amount of time
> 
> Perhaps something like:
> 
> set server stagefail_policy = <hold|discard>
> set server stagefail_holdtimeout = 90

Yeah, configurable is always better.
Comment 4 ramon.bastiaans 2010-11-17 04:27:17 MST
these jobs keep bouncing around in my batch.

as you can see this job is stuck for 3 days now:

12:24 ce.my.fqdn.xxx:/var/tmp/wax 
root# grep 8318989 /var/spool/torque/server_logs/20101116
11/16/2010 09:34:33;0100;PBS_Server;Job;8318989.ce.my.fqdn.xxx;enqueuing into
medium, state 1 hop 1
11/16/2010 09:34:33;0086;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Requeueing job,
substate: 11 Requeued in queue: medium
11/16/2010 11:08:54;0100;PBS_Server;Job;8318989.ce.my.fqdn.xxx;enqueuing into
medium, state 1 hop 1
11/16/2010 11:08:54;0086;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Requeueing job,
substate: 11 Requeued in queue: medium
11/16/2010 16:56:02;0100;PBS_Server;Job;8318989.ce.my.fqdn.xxx;enqueuing into
medium, state 1 hop 1
11/16/2010 16:56:02;0086;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Requeueing job,
substate: 11 Requeued in queue: medium
11/16/2010 18:25:16;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root@ce.my.fqdn.xxx
11/16/2010 18:25:16;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 18:25:52;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)
11/16/2010 18:55:52;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11)
11/16/2010 18:55:53;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root@ce.my.fqdn.xxx
11/16/2010 18:55:53;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 18:55:57;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)
11/16/2010 19:25:57;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11)
11/16/2010 19:26:28;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root@ce.my.fqdn.xxx
11/16/2010 19:26:28;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 19:26:32;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)
11/16/2010 19:56:32;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11)
11/16/2010 19:56:49;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root@ce.my.fqdn.xxx
11/16/2010 19:56:49;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 19:56:53;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)
11/16/2010 20:26:53;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11)
11/16/2010 20:27:14;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root@ce.my.fqdn.xxx
11/16/2010 20:27:14;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 20:27:19;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)
11/16/2010 20:57:19;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11)
11/16/2010 20:57:47;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root@ce.my.fqdn.xxx
11/16/2010 20:57:47;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 20:57:52;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)
11/16/2010 21:27:52;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11)
11/16/2010 21:28:50;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root@ce.my.fqdn.xxx
11/16/2010 21:28:50;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 21:28:55;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)
11/16/2010 21:58:55;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11)
11/16/2010 21:59:37;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root@ce.my.fqdn.xxx
11/16/2010 21:59:37;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 21:59:41;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)
11/16/2010 22:29:41;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11)
11/16/2010 22:30:09;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root@ce.my.fqdn.xxx
11/16/2010 22:30:09;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 22:30:14;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)
11/16/2010 23:00:14;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11)
11/16/2010 23:00:37;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root@ce.my.fqdn.xxx
11/16/2010 23:00:37;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 23:00:42;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)
11/16/2010 23:30:42;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11)
11/16/2010 23:31:16;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root@ce.my.fqdn.xxx
11/16/2010 23:31:16;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 23:31:20;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)

12:24 ce.my.fqdn.xxx:/var/tmp/wax 
root# showq | grep Hold
8318989              lhb038       Hold     1  1:12:00:00  Sun Nov 14 20:12:11

12:25 ce.my.fqdn.xxx:/var/tmp/wax 
root# qstat -ns 8318989

ce.my.fqdn.xxx: 
                                                                         Req'd 
Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK Memory
Time  S Time
-------------------- -------- -------- ---------------- ------ ----- --- ------
----- - -----
8318989.ce.gina.     lhb038   medium   crm01_533024086     --      1   1    -- 
36:00 W   -- 
   v35-13/5
    -- 

12:25 ce.my.fqdn.xxx:/var/tmp/wax 
root#