[torquedev] [Bug 185] New: Can't delete job arrays with dead jobs, such arrays should never be loaded
bugzilla-daemon at supercluster.org
bugzilla-daemon at supercluster.org
Wed Apr 25 07:42:54 MDT 2012
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=185
Summary: Can't delete job arrays with dead jobs, such arrays
should never be loaded
Product: TORQUE
Version: 3.0.x
Platform: PC
OS/Version: Linux
Status: NEW
Severity: major
Priority: P5
Component: pbs_server
AssignedTo: dbeer at adaptivecomputing.com
ReportedBy: rhys.hill at adelaide.edu.au
CC: torquedev at supercluster.org
Estimated Hours: 0.0
Created an attachment (id=105)
--> (http://www.clusterresources.com/bugzilla/attachment.cgi?id=105)
An example of a job array which cannot be deleted, due to a corrupt/phantom
job.
It's possible to have corrupt job arrays where they contain jobs that no longer
exist. The effect of this is that the array becomes impossible to delete,
because currently job arrays are deleted via reference counted garbage
collection during purging of the jobs in the array. An attempt to delete such
an array currently results in no errors, but the array is not removed.
This patch:
Index: src/server/array_func.c
===================================================================
--- src/server/array_func.c (revision 6023)
+++ src/server/array_func.c (working copy)
@@ -1272,6 +1272,7 @@
{
int i;
int num_skipped = 0;
+ int num_jobs = 0;
job *pjob;
@@ -1287,6 +1288,7 @@
}
else
{
+ num_jobs++;
if (pjob->ji_qs.ji_state >= JOB_STATE_EXITING)
{
/* invalid state for request, skip */
@@ -1303,7 +1305,12 @@
}
}
}
-
+
+ /* If there were no valid jobs, return -1. */
+ if(num_jobs==0){
+ return -1;
+ }
+
return(num_skipped);
} /* END delete_whole_array() */
Index: src/server/req_deletearray.c
===================================================================
--- src/server/req_deletearray.c (revision 6023)
+++ src/server/req_deletearray.c (working copy)
@@ -305,10 +305,14 @@
log_event(PBSEVENT_JOB, PBS_EVENTCLASS_JOB, __func__, log_buf);
}
+ if (num_skipped == -1) {
+ /* the array had no jobs within it, delete it. */
+ log_event(PBSEVENT_JOB, PBS_EVENTCLASS_JOB, __func__, "Found array with
no jobs - deleting structures.");
+ array_delete(pa);
+ } else if (num_skipped != 0)
+ {
/* some jobs were not deleted. They must have been running or had
JOB_SUBSTATE_TRANSIT */
- if (num_skipped != 0)
- {
ptask = set_task(WORK_Timed, time_now + 10, array_delete_wt, preq,
FALSE);
if (ptask)
explicitly deletes the array object if it is found to contain no jobs or no
valid jobs.
The code which resurrects arrays from disk should also be improved to delete
such jobs rather than requeing them, but I'll leave that to the experts! :)
With this patch, the arrays can at least be removed.
I've also attached an example bad job. It appears the job is stuck in an
exiting state, and never moves on.
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
More information about the torquedev
mailing list