[torquedev] [Bug 96] New: handle failed stagein jobs properly
bugzilla-daemon at supercluster.org
bugzilla-daemon at supercluster.org
Fri Nov 5 07:45:03 MDT 2010
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=96
Summary: handle failed stagein jobs properly
Product: TORQUE
Version: 2.4.x
Platform: PC
OS/Version: Linux
Status: NEW
Severity: enhancement
Priority: P5
Component: pbs_server
AssignedTo: glen.beane at gmail.com
ReportedBy: ramon.bastiaans at sara.nl
CC: torquedev at supercluster.org
Estimated Hours: 0.0
Whenever a job uses stagein and staging in of files fails (because perhaps the
source file does not exist), the job is stuck in Waiting status. The job
already has a CPU/node/slot assigned but won't start anymore.
The job stays in Waiting status forever, clogging the scheduling.
Here is how it shows in the pbs_server log:
11/03/2010 11:35:22;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
setting job 8056228.ce.XXX state from TRANSIT-TRANSICM to
QUEUED-PRESTAGEIN (1-11)
11/03/2010 11:35:22;0100;PBS_Server;Job;8056228.ce.XXX;enqueuing into
infra, state 1 hop 1
11/03/2010 11:35:22;0008;PBS_Server;Job;8056228.ce.XXX;Job Queued at
request of ramonb at ce.XXX, owner = ramonb at ce.XXX, job name =
testjob.sh, queue = infra
11/03/2010 11:35:30;0008;PBS_Server;Job;8056228.ce.XXX;Job Run at
request of root at ce.XXX
11/03/2010 11:35:30;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
setting job 8056228.ce.XXX state from QUEUED-PRESTAGEIN to
RUNNING-STAGEGO (4-15)
11/03/2010 11:35:35;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
setting job 8056228.ce.XXX state from RUNNING-STAGEGO to
WAITING-STAGEFAIL (3-37)
The job never comes out of this WAITING-STAGEFAIL status anymore.
In qstat these job end up looking like this:
7921738.ce. alisgm05 medium crm01_325857094 -- 1 1
-- 36:00 W --
am94-05/1
Where am94-05 is the hostname of the compute node assigned to it.
Another related problem is that prologue is only executed _after_ staging has
been completed, which prevents any stage testing to be done from the prologue.
As an side effect while jobs are stuck in the WAITING-STAGEFAIL status, they
prevent new jobs from being scheduled. Once the amount of stuck waiting jobs
reaches maui's reservationdepth, no new reservations can be created anymore and
the entire scheduling is stuck in a deadlock.
Please implement a proper way of handling failed stage jobs, i.e. an option to
either:
* automatically delete/discard failed stage jobs
* automatically hold failed stage jobs
* and/or possibly run prologue _before_ staging takes place, not after
* etc
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
More information about the torquedev
mailing list