[torquedev] [Bug 185] New: Can't delete job arrays with dead jobs, such arrays should never be loaded

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Wed Apr 25 07:42:54 MDT 2012


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=185

           Summary: Can't delete job arrays with dead jobs, such arrays
                    should never be loaded
           Product: TORQUE
           Version: 3.0.x
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: major
          Priority: P5
         Component: pbs_server
        AssignedTo: dbeer at adaptivecomputing.com
        ReportedBy: rhys.hill at adelaide.edu.au
                CC: torquedev at supercluster.org
   Estimated Hours: 0.0


Created an attachment (id=105)
 --> (http://www.clusterresources.com/bugzilla/attachment.cgi?id=105)
An example of a job array which cannot be deleted, due to a corrupt/phantom
job.

It's possible to have corrupt job arrays where they contain jobs that no longer
exist. The effect of this is that the array becomes impossible to delete,
because currently job arrays are deleted via reference counted garbage
collection during purging of the jobs in the array. An attempt to delete such
an array currently results in no errors, but the array is not removed.

This patch:

Index: src/server/array_func.c
===================================================================
--- src/server/array_func.c    (revision 6023)
+++ src/server/array_func.c    (working copy)
@@ -1272,6 +1272,7 @@
   {
   int i;
   int num_skipped = 0;
+  int num_jobs = 0;

   job *pjob;

@@ -1287,6 +1288,7 @@
       }
     else
       {
+    num_jobs++;
       if (pjob->ji_qs.ji_state >= JOB_STATE_EXITING)
         {
         /* invalid state for request,  skip */
@@ -1303,7 +1305,12 @@
         }
       }
     }
-
+  
+  /* If there were no valid jobs, return -1. */
+  if(num_jobs==0){
+    return -1;
+  }
+  
   return(num_skipped);
   } /* END delete_whole_array() */
Index: src/server/req_deletearray.c
===================================================================
--- src/server/req_deletearray.c    (revision 6023)
+++ src/server/req_deletearray.c    (working copy)
@@ -305,10 +305,14 @@
       log_event(PBSEVENT_JOB, PBS_EVENTCLASS_JOB, __func__, log_buf);
       }

+    if (num_skipped == -1) {
+      /* the array had no jobs within it, delete it. */
+      log_event(PBSEVENT_JOB, PBS_EVENTCLASS_JOB, __func__, "Found array with
no jobs - deleting structures.");
+      array_delete(pa);
+    } else if (num_skipped != 0)
+      {
     /* some jobs were not deleted.  They must have been running or had
        JOB_SUBSTATE_TRANSIT */
-    if (num_skipped != 0)
-      {
       ptask = set_task(WORK_Timed, time_now + 10, array_delete_wt, preq,
FALSE);

       if (ptask)

explicitly deletes the array object if it is found to contain no jobs or no
valid jobs.

The code which resurrects arrays from disk should also be improved to delete
such jobs rather than requeing them, but I'll leave that to the experts! :)
With this patch, the arrays can at least be removed.

I've also attached an example bad job. It appears the job is stuck in an
exiting state, and never moves on.

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list