Bug 185 - Can't delete job arrays with dead jobs, such arrays should never be loaded
: Can't delete job arrays with dead jobs, such arrays should never be loaded
Status: RESOLVED FIXED
Product: TORQUE
pbs_server
: 4.0.*
: PC Linux
: P5 major
Assigned To: David Beer
:
:
:
  Show dependency treegraph
 
Reported: 2012-04-25 07:42 MDT by rhys.hill
Modified: 2012-05-03 13:53 MDT (History)
2 users (show)

See Also:


Attachments
An example of a job array which cannot be deleted, due to a corrupt/phantom job. (2.36 KB, application/x-gzip)
2012-04-25 07:42 MDT, rhys.hill
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description rhys.hill 2012-04-25 07:42:54 MDT
Created an attachment (id=105) [details]
An example of a job array which cannot be deleted, due to a corrupt/phantom
job.

It's possible to have corrupt job arrays where they contain jobs that no longer
exist. The effect of this is that the array becomes impossible to delete,
because currently job arrays are deleted via reference counted garbage
collection during purging of the jobs in the array. An attempt to delete such
an array currently results in no errors, but the array is not removed.

This patch:

Index: src/server/array_func.c
===================================================================
--- src/server/array_func.c    (revision 6023)
+++ src/server/array_func.c    (working copy)
@@ -1272,6 +1272,7 @@
   {
   int i;
   int num_skipped = 0;
+  int num_jobs = 0;

   job *pjob;

@@ -1287,6 +1288,7 @@
       }
     else
       {
+    num_jobs++;
       if (pjob->ji_qs.ji_state >= JOB_STATE_EXITING)
         {
         /* invalid state for request,  skip */
@@ -1303,7 +1305,12 @@
         }
       }
     }
-
+  
+  /* If there were no valid jobs, return -1. */
+  if(num_jobs==0){
+    return -1;
+  }
+  
   return(num_skipped);
   } /* END delete_whole_array() */
Index: src/server/req_deletearray.c
===================================================================
--- src/server/req_deletearray.c    (revision 6023)
+++ src/server/req_deletearray.c    (working copy)
@@ -305,10 +305,14 @@
       log_event(PBSEVENT_JOB, PBS_EVENTCLASS_JOB, __func__, log_buf);
       }

+    if (num_skipped == -1) {
+      /* the array had no jobs within it, delete it. */
+      log_event(PBSEVENT_JOB, PBS_EVENTCLASS_JOB, __func__, "Found array with
no jobs - deleting structures.");
+      array_delete(pa);
+    } else if (num_skipped != 0)
+      {
     /* some jobs were not deleted.  They must have been running or had
        JOB_SUBSTATE_TRANSIT */
-    if (num_skipped != 0)
-      {
       ptask = set_task(WORK_Timed, time_now + 10, array_delete_wt, preq,
FALSE);

       if (ptask)

explicitly deletes the array object if it is found to contain no jobs or no
valid jobs.

The code which resurrects arrays from disk should also be improved to delete
such jobs rather than requeing them, but I'll leave that to the experts! :)
With this patch, the arrays can at least be removed.

I've also attached an example bad job. It appears the job is stuck in an
exiting state, and never moves on.
Comment 1 David Beer 2012-05-03 13:53:50 MDT
Checked in to 4.0.2. We also now avoid loading arrays that have no valid jobs.