[torquedev] [Bug 96] New: handle failed stagein jobs properly

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Fri Nov 5 07:45:03 MDT 2010


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=96

           Summary: handle failed stagein jobs properly
           Product: TORQUE
           Version: 2.4.x
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: pbs_server
        AssignedTo: glen.beane at gmail.com
        ReportedBy: ramon.bastiaans at sara.nl
                CC: torquedev at supercluster.org
   Estimated Hours: 0.0


Whenever a job uses stagein and staging in of files fails (because perhaps the
source file does not exist), the job is stuck in Waiting status. The job
already has a CPU/node/slot assigned but won't start anymore.

The job stays in Waiting status forever, clogging the scheduling.

Here is how it shows in the pbs_server log:

11/03/2010 11:35:22;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
setting job 8056228.ce.XXX state from TRANSIT-TRANSICM to 
QUEUED-PRESTAGEIN (1-11)
11/03/2010 11:35:22;0100;PBS_Server;Job;8056228.ce.XXX;enqueuing into 
infra, state 1 hop 1
11/03/2010 11:35:22;0008;PBS_Server;Job;8056228.ce.XXX;Job Queued at 
request of ramonb at ce.XXX, owner = ramonb at ce.XXX, job name = 
testjob.sh, queue = infra
11/03/2010 11:35:30;0008;PBS_Server;Job;8056228.ce.XXX;Job Run at 
request of root at ce.XXX
11/03/2010 11:35:30;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
setting job 8056228.ce.XXX state from QUEUED-PRESTAGEIN to 
RUNNING-STAGEGO (4-15)
11/03/2010 11:35:35;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
setting job 8056228.ce.XXX state from RUNNING-STAGEGO to 
WAITING-STAGEFAIL (3-37)

The job never comes out of this WAITING-STAGEFAIL status anymore.

In qstat these job end up looking like this:

 7921738.ce.     alisgm05 medium   crm01_325857094     --      1   1
 --  36:00 W   --
    am94-05/1

Where am94-05 is the hostname of the compute node assigned to it.

Another related problem is that prologue is only executed _after_ staging has
been completed, which prevents any stage testing to be done from the prologue.

As an side effect while jobs are stuck in the WAITING-STAGEFAIL status, they
prevent new jobs from being scheduled. Once the amount of stuck waiting jobs
reaches maui's reservationdepth, no new reservations can be created anymore and
the entire scheduling is stuck in a deadlock.

Please implement a proper way of handling failed stage jobs, i.e. an option to
either:
 * automatically delete/discard failed stage jobs
 * automatically hold failed stage jobs
 * and/or possibly run prologue _before_ staging takes place, not after
 * etc

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list