Bugzilla – Bug 185
Can't delete job arrays with dead jobs, such arrays should never be loaded
Last modified: 2012-05-03 13:53:50 MDT
You need to log in before you can comment on or make changes to this bug.
Created an attachment (id=105) [details] An example of a job array which cannot be deleted, due to a corrupt/phantom job. It's possible to have corrupt job arrays where they contain jobs that no longer exist. The effect of this is that the array becomes impossible to delete, because currently job arrays are deleted via reference counted garbage collection during purging of the jobs in the array. An attempt to delete such an array currently results in no errors, but the array is not removed. This patch: Index: src/server/array_func.c =================================================================== --- src/server/array_func.c (revision 6023) +++ src/server/array_func.c (working copy) @@ -1272,6 +1272,7 @@ { int i; int num_skipped = 0; + int num_jobs = 0; job *pjob; @@ -1287,6 +1288,7 @@ } else { + num_jobs++; if (pjob->ji_qs.ji_state >= JOB_STATE_EXITING) { /* invalid state for request, skip */ @@ -1303,7 +1305,12 @@ } } } - + + /* If there were no valid jobs, return -1. */ + if(num_jobs==0){ + return -1; + } + return(num_skipped); } /* END delete_whole_array() */ Index: src/server/req_deletearray.c =================================================================== --- src/server/req_deletearray.c (revision 6023) +++ src/server/req_deletearray.c (working copy) @@ -305,10 +305,14 @@ log_event(PBSEVENT_JOB, PBS_EVENTCLASS_JOB, __func__, log_buf); } + if (num_skipped == -1) { + /* the array had no jobs within it, delete it. */ + log_event(PBSEVENT_JOB, PBS_EVENTCLASS_JOB, __func__, "Found array with no jobs - deleting structures."); + array_delete(pa); + } else if (num_skipped != 0) + { /* some jobs were not deleted. They must have been running or had JOB_SUBSTATE_TRANSIT */ - if (num_skipped != 0) - { ptask = set_task(WORK_Timed, time_now + 10, array_delete_wt, preq, FALSE); if (ptask) explicitly deletes the array object if it is found to contain no jobs or no valid jobs. The code which resurrects arrays from disk should also be improved to delete such jobs rather than requeing them, but I'll leave that to the experts! :) With this patch, the arrays can at least be removed. I've also attached an example bad job. It appears the job is stuck in an exiting state, and never moves on.
Checked in to 4.0.2. We also now avoid loading arrays that have no valid jobs.