[torqueusers] reported cpu time during running parallel jobs in torque 2.1.3...

Garrick Staples garrick at clusterresources.com
Wed Oct 18 19:35:32 MDT 2006


On Wed, Oct 18, 2006 at 01:39:17PM -0600, Garrick Staples alleged:
> On Wed, Oct 18, 2006 at 12:26:18PM -0600, Garrick Staples alleged:
> > On Wed, Oct 18, 2006 at 05:40:40PM +0100, David Golden alleged:
> > > Well, perhaps in some sort of karmic revenge after on-list discussion of 
> > > cput time accounting while back, just tried upgrading to torque 2.1.3, and it 
> > > seems something strange is going on with _recent_ torque:
> > > 
> > > The resources_used.cput number ultimately reported  in 
> > > e.g. /var/spool/pbs/server_priv/accounting/ for 
> > > parallel jobs still seems accurate enough
> > > 
> > > However, qstat -f is underreporting, even when job is in "C" state, maybe  
> > > as if it's only reporting the job's mother superior node's processes 
> > > cput - and I think the issue might also be mangling our maui stats...
> > 
> > That's peculiar.
> > 
> > Looking...
> 
> It seems that sister MOMs aren't sending regular updates of cput, it
> only happens at the very end.
> 
> Plus there is some sort of a race condition preventing the final
> resources update (that gets into the accounting record) from getting to
> the stat output.
> 
> Still looking...

I think this fixes both problems.  Initial tests are good, but I want to
bang at it some more.


Index: src/resmom/mom_main.c
===================================================================
--- src/resmom/mom_main.c       (revision 1053)
+++ src/resmom/mom_main.c       (working copy)
@@ -6799,14 +6799,14 @@
       if (pjob->ji_qs.ji_substate != JOB_SUBSTATE_RUNNING)
         continue;
 
-      if ((pjob->ji_qs.ji_svrflags & JOB_SVFLG_HERE) == 0)
-        continue;
-
       /* update information for my tasks */
 
       mom_set_use(pjob);
       rpp_io();
 
+      if ((pjob->ji_qs.ji_svrflags & JOB_SVFLG_HERE) == 0)
+        continue;
+
       /* has all job processes vanished undetected ?       */
       /* double check by sig0 to session pid for each task */
 
Index: src/server/req_jobobit.c
===================================================================
--- src/server/req_jobobit.c    (revision 1053)
+++ src/server/req_jobobit.c    (working copy)
@@ -1626,6 +1626,13 @@
   pjob->ji_wattr[(int)JOB_ATR_exitstat].at_flags |=ATR_VFLAG_SET;
 
   patlist = (svrattrl *)GET_NEXT(preq->rq_ind.rq_jobobit.rq_attr);
+ 
+  /* Encode the final resources_used into the job (useful for keep_completed) */
+  modify_job_attr(
+    pjob,
+    patlist,
+    ATR_DFLAG_MGWR | ATR_DFLAG_SvWR,
+    &bad);
 
   sprintf(acctbuf,msg_job_end_stat, 
     pjob->ji_qs.ji_un.ji_exect.ji_exitstat);



More information about the torqueusers mailing list