[torqueusers] reported cpu time during running parallel jobs in
torque 2.1.3...
Garrick Staples
garrick at clusterresources.com
Wed Oct 18 19:35:32 MDT 2006
On Wed, Oct 18, 2006 at 01:39:17PM -0600, Garrick Staples alleged:
> On Wed, Oct 18, 2006 at 12:26:18PM -0600, Garrick Staples alleged:
> > On Wed, Oct 18, 2006 at 05:40:40PM +0100, David Golden alleged:
> > > Well, perhaps in some sort of karmic revenge after on-list discussion of
> > > cput time accounting while back, just tried upgrading to torque 2.1.3, and it
> > > seems something strange is going on with _recent_ torque:
> > >
> > > The resources_used.cput number ultimately reported in
> > > e.g. /var/spool/pbs/server_priv/accounting/ for
> > > parallel jobs still seems accurate enough
> > >
> > > However, qstat -f is underreporting, even when job is in "C" state, maybe
> > > as if it's only reporting the job's mother superior node's processes
> > > cput - and I think the issue might also be mangling our maui stats...
> >
> > That's peculiar.
> >
> > Looking...
>
> It seems that sister MOMs aren't sending regular updates of cput, it
> only happens at the very end.
>
> Plus there is some sort of a race condition preventing the final
> resources update (that gets into the accounting record) from getting to
> the stat output.
>
> Still looking...
I think this fixes both problems. Initial tests are good, but I want to
bang at it some more.
Index: src/resmom/mom_main.c
===================================================================
--- src/resmom/mom_main.c (revision 1053)
+++ src/resmom/mom_main.c (working copy)
@@ -6799,14 +6799,14 @@
if (pjob->ji_qs.ji_substate != JOB_SUBSTATE_RUNNING)
continue;
- if ((pjob->ji_qs.ji_svrflags & JOB_SVFLG_HERE) == 0)
- continue;
-
/* update information for my tasks */
mom_set_use(pjob);
rpp_io();
+ if ((pjob->ji_qs.ji_svrflags & JOB_SVFLG_HERE) == 0)
+ continue;
+
/* has all job processes vanished undetected ? */
/* double check by sig0 to session pid for each task */
Index: src/server/req_jobobit.c
===================================================================
--- src/server/req_jobobit.c (revision 1053)
+++ src/server/req_jobobit.c (working copy)
@@ -1626,6 +1626,13 @@
pjob->ji_wattr[(int)JOB_ATR_exitstat].at_flags |=ATR_VFLAG_SET;
patlist = (svrattrl *)GET_NEXT(preq->rq_ind.rq_jobobit.rq_attr);
+
+ /* Encode the final resources_used into the job (useful for keep_completed) */
+ modify_job_attr(
+ pjob,
+ patlist,
+ ATR_DFLAG_MGWR | ATR_DFLAG_SvWR,
+ &bad);
sprintf(acctbuf,msg_job_end_stat,
pjob->ji_qs.ji_un.ji_exect.ji_exitstat);
More information about the torqueusers
mailing list