[torquedev] New 2.5.6 snapshot

Martin Siegert siegert at sfu.ca
Wed Apr 20 19:21:10 MDT 2011


On Tue, Apr 19, 2011 at 10:33:21AM -0600, David Beer wrote:
> 
> 
> ----- Original Message -----
> > Hi,
> > 
> > On Thu, Apr 07, 2011 at 05:05:04PM -0600, Ken Nielson wrote:
> > > There is a new snapshot for 2.5.6 available. This fixes a problem
> > > with
> > > a patch for Bugzilla 116 where the new resource procct was added. If
> > > the
> > > -l nodes option was not used in a job submission then the job would
> > > not
> > > be run by Moab because procct was added to the Resource_List
> > > attribute
> > > and treated like a generic resource by Moab. Because the generic
> > > resource
> > > procct does not exist Moab never schedules the job.
> > >
> > > This is now fixed.
> > >
> > > You can download this snapshot at
> > > http://www.clusterresources.com/downloads/torque/snapshots/torque-2.5.6-snap.201104071657.tar.gz
> > >
> > > Please download and let us know if you find any problems.
> > 
> > I am afraid this does not work: I haven't traced this back to the
> > source routine, but apparently this new version presets the nodes
> > resource to 1, correct?
> > Thus, if a user only requests -l procs=N, with 2.5.6-snap.201104071657
> > procct is set to N+1, not N, see
> > 
> > resc_def_all.c, line 1118:
> > 
> > ppct->rs_value.at_val.at_long =
> > count_proc(pnodesp->rs_value.at_val.at_str)
> > + pprocsp->rs_value.at_val.at_long;
> > 
> > torque-2.5.6-snap.201104041023 actually worked flawlessly for me.
> > Which means that I haven't figured out how to trigger the bug that
> > torque-2.5.6-snap.201104071657 was supposed to fix.
> > Regardless of whether I specified -l nodes=... or -l procs=... or
> > neither moab always started my job, i.e., the procct resource
> > always got removed before the job was sent to moab, see,
> > 
> > svr_jobfunc.c, line 1965:
> > 
> > if (strcmp(pque->qu_attr->at_val.at_str, "Execution") == 0)
> > {
> > /* job routed to Execution queue successfully */
> > /* unset job's procct resource */
> > resource_def *pctdef;
> > resource *pctresc;
> > pctdef = find_resc_def(svr_resc_def, "procct", svr_resc_size);
> > if ((pctresc = find_resc_entry(&pjob->ji_wattr[JOB_ATR_resource],
> > pctdef)) != NULL)
> > pctdef->rs_free(&pctresc->rs_value);
> > }
> > }
> > 
> > If somebody can explain to me how to submit a job that is not caught
> > in
> > this if block, I may be able to fix this.
> > 
> 
> This issue is now resolved. The problem was where no resource was requested and then the nodes request was applied by default. This was resolved by adding code to free the resource after queue and server defaults are applied. The new snapshot can be found here:
> 
> http://www.clusterresources.com/downloads/torque/snapshots/torque-2.5.6-snap.201104191030.tar.gz

Sorry this does not work either:
The new code in set_resc_deflt (in svr_jobfunc.c)

  /* unset the procct resource if it has been set */
  pctdef = find_resc_def(svr_resc_def, "procct", svr_resc_size);

  if ((pctresc = find_resc_entry(ja, pctdef)) != NULL)
    pctdef->rs_free(&pctresc->rs_value);

unsets procct before the routing is done. Which basically disables
all of the procct routing code and makes procct completely useless.

But I have been able to reproduce the bug that occurs when setting

set server resources_default.nodes = 1

and submitting a job that requests no resources.
I'll try to fix this.

Cheers,
Martin


More information about the torquedev mailing list