[torquedev] cpu use reporting

garrick at speculation.org garrick at speculation.org
Wed Jun 14 17:44:43 MDT 2006


On Mon, Jun 12, 2006 at 10:53:08AM -0400, Brock Palen alleged:
> We are seeing strange problems with torque-2.1.0p0  backed by  
> maui-3.2.6p14.
> Basically a single cpu is being allocated more than once.

Two requirements for this bug as I saw it.  job polling must be enabled
(enabled by default in 2.1.0) and keep_completed must not be enabled.

The job poll task for a given job wasn't being removed after the job
exited.  This was causing an extra job_purge() on a job that no longer
exists and causing a mild memory corruption.

Diff below seems to fix and was just applied to CVS.

diff -u -r1.50 -r1.51
--- src/server/pbsd_main.c      12 Jun 2006 16:24:09 -0000      1.50
+++ src/server/pbsd_main.c      14 Jun 2006 23:40:03 -0000      1.51
@@ -1056,6 +1056,8 @@
     if (server.sv_attr[(int)SRV_ATR_PollJobs].at_val.at_long && 
        (last_jobstat_time + JobStatRate <= time_now))
       {
+      struct work_task *ptask;
+
       for (pjob = (job *)GET_NEXT(svr_alljobs);
            pjob != NULL;
            pjob = (job *)GET_NEXT(pjob->ji_alljobs)) 
@@ -1066,7 +1068,12 @@
 
           when = pjob->ji_wattr[(int)JOB_ATR_qrank].at_val.at_long % JobStatRate;
     
-          set_task(WORK_Timed,when + time_now,poll_job_task,pjob);
+          ptask = set_task(WORK_Timed,when + time_now,poll_job_task,pjob);
+
+          if (ptask)
+            {
+            append_link(&pjob->ji_svrtask,&ptask->wt_linkobj,ptask);
+            }
           }
         }
 



More information about the torquedev mailing list