Bugzilla – Bug 96
handle failed stagein jobs properly
Last modified: 2010-11-17 04:27:17 MST
You need to log in before you can comment on or make changes to this bug.
Whenever a job uses stagein and staging in of files fails (because perhaps the source file does not exist), the job is stuck in Waiting status. The job already has a CPU/node/slot assigned but won't start anymore. The job stays in Waiting status forever, clogging the scheduling. Here is how it shows in the pbs_server log: 11/03/2010 11:35:22;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8056228.ce.XXX state from TRANSIT-TRANSICM to QUEUED-PRESTAGEIN (1-11) 11/03/2010 11:35:22;0100;PBS_Server;Job;8056228.ce.XXX;enqueuing into infra, state 1 hop 1 11/03/2010 11:35:22;0008;PBS_Server;Job;8056228.ce.XXX;Job Queued at request of ramonb@ce.XXX, owner = ramonb@ce.XXX, job name = testjob.sh, queue = infra 11/03/2010 11:35:30;0008;PBS_Server;Job;8056228.ce.XXX;Job Run at request of root@ce.XXX 11/03/2010 11:35:30;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8056228.ce.XXX state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15) 11/03/2010 11:35:35;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8056228.ce.XXX state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37) The job never comes out of this WAITING-STAGEFAIL status anymore. In qstat these job end up looking like this: 7921738.ce. alisgm05 medium crm01_325857094 -- 1 1 -- 36:00 W -- am94-05/1 Where am94-05 is the hostname of the compute node assigned to it. Another related problem is that prologue is only executed _after_ staging has been completed, which prevents any stage testing to be done from the prologue. As an side effect while jobs are stuck in the WAITING-STAGEFAIL status, they prevent new jobs from being scheduled. Once the amount of stuck waiting jobs reaches maui's reservationdepth, no new reservations can be created anymore and the entire scheduling is stuck in a deadlock. Please implement a proper way of handling failed stage jobs, i.e. an option to either: * automatically delete/discard failed stage jobs * automatically hold failed stage jobs * and/or possibly run prologue _before_ staging takes place, not after * etc
This is the code handling this state. From looking at it, it seems pretty reasonable. The job is held and owner is mailed so he will either delete, or unhold the job (although I don't know how that can be done). Plus jobs seem to be held in this state for only 1800 seconds (30 minutes). if (code != 0) { /* stage in failed - hold job */ free_nodes(pjob); pwait = &pjob->ji_wattr[(int)JOB_ATR_exectime]; if ((pwait->at_flags & ATR_VFLAG_SET) == 0) { pwait->at_val.at_long = time_now + PBS_STAGEFAIL_WAIT; pwait->at_flags |= ATR_VFLAG_SET; job_set_wait(pwait, pjob, 0); } svr_setjobstate(pjob, JOB_STATE_WAITING, JOB_SUBSTATE_STAGEFAIL); if (preq->rq_reply.brp_choice == BATCH_REPLY_CHOICE_Text) { /* set job comment */ /* NYI */ svr_mailowner( pjob, MAIL_STAGEIN, MAIL_FORCE, preq->rq_reply.brp_un.brp_txt.brp_str); } } else
well, for one, we have mailing disabled by Torque using; qmgr -c 'set server mail_domain = never' So that email will never get sent to the end user. Furthermore, our end users can not login to our system, we have no interactive (shell) access machine. So they will never be able to retrieve that email. Why won't the job the enter the "H" held/hold state if it's working as designed, but stays in the "W"aiting status? Seems to me the node assignment should be released if this happens and now not it is not (probably because it stays in Waiting status?). Secondly, why hold it 30 minutes? A lot can happen in 30 minutes. It would be nice to make that policy configurable. Like I said to either; * delete/discard * hold (really hold, not stay in waiting) * hold for an configurable amount of time Perhaps something like: set server stagefail_policy = <hold|discard> set server stagefail_holdtimeout = 90
(In reply to comment #2) > Why won't the job the enter the "H" held/hold state if it's working as > designed, but stays in the "W"aiting status? Seems to me the node assignment > should be released if this happens and now not it is not (probably because it > stays in Waiting status?). As far as I know, held status is only for queued jobs. It is used to determine if a job is ready to be run. This is a different state. Looking at the code, the jobs should release back into queued once the timeout is off. > Secondly, why hold it 30 minutes? A lot can happen in 30 minutes. I don't know, it's just a preset (I consider it quite reasonable). Its a compile time constant, you can easily change it if you don't like it. > It would be nice to make that policy configurable. Like I said to either; > * delete/discard > * hold (really hold, not stay in waiting) > * hold for an configurable amount of time > > Perhaps something like: > > set server stagefail_policy = <hold|discard> > set server stagefail_holdtimeout = 90 Yeah, configurable is always better.
these jobs keep bouncing around in my batch. as you can see this job is stuck for 3 days now: 12:24 ce.my.fqdn.xxx:/var/tmp/wax root# grep 8318989 /var/spool/torque/server_logs/20101116 11/16/2010 09:34:33;0100;PBS_Server;Job;8318989.ce.my.fqdn.xxx;enqueuing into medium, state 1 hop 1 11/16/2010 09:34:33;0086;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Requeueing job, substate: 11 Requeued in queue: medium 11/16/2010 11:08:54;0100;PBS_Server;Job;8318989.ce.my.fqdn.xxx;enqueuing into medium, state 1 hop 1 11/16/2010 11:08:54;0086;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Requeueing job, substate: 11 Requeued in queue: medium 11/16/2010 16:56:02;0100;PBS_Server;Job;8318989.ce.my.fqdn.xxx;enqueuing into medium, state 1 hop 1 11/16/2010 16:56:02;0086;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Requeueing job, substate: 11 Requeued in queue: medium 11/16/2010 18:25:16;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at request of root@ce.my.fqdn.xxx 11/16/2010 18:25:16;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15) 11/16/2010 18:25:52;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37) 11/16/2010 18:55:52;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11) 11/16/2010 18:55:53;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at request of root@ce.my.fqdn.xxx 11/16/2010 18:55:53;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15) 11/16/2010 18:55:57;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37) 11/16/2010 19:25:57;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11) 11/16/2010 19:26:28;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at request of root@ce.my.fqdn.xxx 11/16/2010 19:26:28;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15) 11/16/2010 19:26:32;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37) 11/16/2010 19:56:32;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11) 11/16/2010 19:56:49;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at request of root@ce.my.fqdn.xxx 11/16/2010 19:56:49;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15) 11/16/2010 19:56:53;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37) 11/16/2010 20:26:53;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11) 11/16/2010 20:27:14;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at request of root@ce.my.fqdn.xxx 11/16/2010 20:27:14;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15) 11/16/2010 20:27:19;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37) 11/16/2010 20:57:19;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11) 11/16/2010 20:57:47;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at request of root@ce.my.fqdn.xxx 11/16/2010 20:57:47;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15) 11/16/2010 20:57:52;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37) 11/16/2010 21:27:52;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11) 11/16/2010 21:28:50;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at request of root@ce.my.fqdn.xxx 11/16/2010 21:28:50;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15) 11/16/2010 21:28:55;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37) 11/16/2010 21:58:55;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11) 11/16/2010 21:59:37;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at request of root@ce.my.fqdn.xxx 11/16/2010 21:59:37;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15) 11/16/2010 21:59:41;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37) 11/16/2010 22:29:41;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11) 11/16/2010 22:30:09;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at request of root@ce.my.fqdn.xxx 11/16/2010 22:30:09;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15) 11/16/2010 22:30:14;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37) 11/16/2010 23:00:14;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11) 11/16/2010 23:00:37;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at request of root@ce.my.fqdn.xxx 11/16/2010 23:00:37;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15) 11/16/2010 23:00:42;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37) 11/16/2010 23:30:42;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11) 11/16/2010 23:31:16;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at request of root@ce.my.fqdn.xxx 11/16/2010 23:31:16;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15) 11/16/2010 23:31:20;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37) 12:24 ce.my.fqdn.xxx:/var/tmp/wax root# showq | grep Hold 8318989 lhb038 Hold 1 1:12:00:00 Sun Nov 14 20:12:11 12:25 ce.my.fqdn.xxx:/var/tmp/wax root# qstat -ns 8318989 ce.my.fqdn.xxx: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 8318989.ce.gina. lhb038 medium crm01_533024086 -- 1 1 -- 36:00 W -- v35-13/5 -- 12:25 ce.my.fqdn.xxx:/var/tmp/wax root#