[torqueusers] qalter -lwalltime not propagated to slave moms

Garrick Staples garrick at usc.edu
Mon Nov 28 15:08:35 MST 2005


Definitely a bug that changes aren't propogated to sisters, but your
change for walltime seems reasonable anyways.  I'll commit it.

On Tue, Nov 29, 2005 at 07:39:02AM +1100, David Singleton alleged:
> 
> I dont think any qalter variations are propagated to sisters.
> 
> Our solution for walltime variations is to make sure only MS
> applies the walltime limit.
> 
> 
> src/resmom/*/mom_mach.c:
> 
> int mom_over_limit(job *pjob)
> {
>  .....
> 
>                 } else if (strcmp(pname, "walltime") == 0) {
>                         /* ANUPBS:
>                                - only have MS check walltime
>                                - covers bug: resource modifications are not 
>                                propagated to the
>                                  sisterhood
>                                - assumes only walltime being modified (most 
>                                common)
>                          */
>                         if ((pjob->ji_qs.ji_svrflags & JOB_SVFLG_HERE) == 
>                         0) continue;
>                         retval = local_gettime(pres, &value);
>                         if (retval != PBSE_NONE) continue;
>                         num = (unsigned long)((double)(time_now - 
>                         pjob->ji_qs.ji_stime)*wallfactor);
>                         if (num > value) {
>                                 sprintf(log_buffer,"walltime %lusec 
>                                 exceeded limit %lusec",num, value);
>                                 ret = 
>                                 (JOB_SVFLG_OVERLMT1|JOB_SVFLG_OVERLMTWALL);
>                         }
>                 }
> 
> David
> 
> 
> Martin Schaff?ner wrote:
> >On Monday 28 November 2005 10:37, Thomas Zeiser wrote:
> >
> >>Dear All,
> >>
> >>at least on our cluster, it seems that changes with qalter to the
> >>walltime after the jobs is started are not correctly propagated to
> >>sister moms. As a consequence, parallel jobs started with Pete's
> >>mpiexec get killed once the original walltime is exceeded.
> >
> >
> >I don't know the reason for this, but I can at least confirm this 
> >"feature" in TORQUE 2.0.0p1.
> >
> >Regards,
> 
> 
> -- 
> --------------------------------------------------------------------------
>    Dr David Singleton               ANU Supercomputer Facility
>    HPC Systems Manager              and APAC National Facility
>    David.Singleton at anu.edu.au       Leonard Huxley Bldg (No. 56)
>    Phone: +61 2 6125 4389           Australian National University
>    Fax:   +61 2 6125 8199           Canberra, ACT, 0200, Australia
> --------------------------------------------------------------------------
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051128/c6ded149/attachment.bin


More information about the torqueusers mailing list