[torquedev] New 2.5.6 snapshot

Martin Siegert siegert at sfu.ca
Tue Apr 26 14:25:06 MDT 2011


On Wed, Apr 20, 2011 at 06:21:10PM -0700, Martin Siegert wrote:
> On Tue, Apr 19, 2011 at 10:33:21AM -0600, David Beer wrote:
> > 
> > 
> > ----- Original Message -----
> > > Hi,
> > > 
> > > On Thu, Apr 07, 2011 at 05:05:04PM -0600, Ken Nielson wrote:
> > > > There is a new snapshot for 2.5.6 available. This fixes a problem
> > > > with
> > > > a patch for Bugzilla 116 where the new resource procct was added. If
> > > > the
> > > > -l nodes option was not used in a job submission then the job would
> > > > not
> > > > be run by Moab because procct was added to the Resource_List
> > > > attribute
> > > > and treated like a generic resource by Moab. Because the generic
> > > > resource
> > > > procct does not exist Moab never schedules the job.
> > > >
> > > > This is now fixed.
> > > >
> > > > You can download this snapshot at
> > > > http://www.clusterresources.com/downloads/torque/snapshots/torque-2.5.6-snap.201104071657.tar.gz
> > > >
> > > > Please download and let us know if you find any problems.
> > > 
> > > I am afraid this does not work: I haven't traced this back to the
> > > source routine, but apparently this new version presets the nodes
> > > resource to 1, correct?
> > > Thus, if a user only requests -l procs=N, with 2.5.6-snap.201104071657
> > > procct is set to N+1, not N, see
> > > 
> > > resc_def_all.c, line 1118:
> > > 
> > > ppct->rs_value.at_val.at_long =
> > > count_proc(pnodesp->rs_value.at_val.at_str)
> > > + pprocsp->rs_value.at_val.at_long;
> > > 
> > > torque-2.5.6-snap.201104041023 actually worked flawlessly for me.
> > > Which means that I haven't figured out how to trigger the bug that
> > > torque-2.5.6-snap.201104071657 was supposed to fix.
> > > Regardless of whether I specified -l nodes=... or -l procs=... or
> > > neither moab always started my job, i.e., the procct resource
> > > always got removed before the job was sent to moab, see,
> > > 
> > > svr_jobfunc.c, line 1965:
> > > 
> > > if (strcmp(pque->qu_attr->at_val.at_str, "Execution") == 0)
> > > {
> > > /* job routed to Execution queue successfully */
> > > /* unset job's procct resource */
> > > resource_def *pctdef;
> > > resource *pctresc;
> > > pctdef = find_resc_def(svr_resc_def, "procct", svr_resc_size);
> > > if ((pctresc = find_resc_entry(&pjob->ji_wattr[JOB_ATR_resource],
> > > pctdef)) != NULL)
> > > pctdef->rs_free(&pctresc->rs_value);
> > > }
> > > }
> > > 
> > > If somebody can explain to me how to submit a job that is not caught
> > > in
> > > this if block, I may be able to fix this.
> > > 
> > 
> > This issue is now resolved. The problem was where no resource was requested and then the nodes request was applied by default. This was resolved by adding code to free the resource after queue and server defaults are applied. The new snapshot can be found here:
> > 
> > http://www.clusterresources.com/downloads/torque/snapshots/torque-2.5.6-snap.201104191030.tar.gz
> 
> Sorry this does not work either:
> The new code in set_resc_deflt (in svr_jobfunc.c)
> 
>   /* unset the procct resource if it has been set */
>   pctdef = find_resc_def(svr_resc_def, "procct", svr_resc_size);
> 
>   if ((pctresc = find_resc_entry(ja, pctdef)) != NULL)
>     pctdef->rs_free(&pctresc->rs_value);
> 
> unsets procct before the routing is done. Which basically disables
> all of the procct routing code and makes procct completely useless.
> 
> But I have been able to reproduce the bug that occurs when setting
> 
> set server resources_default.nodes = 1
> 
> and submitting a job that requests no resources.
> I'll try to fix this.

I believe the problem is that set_resc_deflt is called twice: once from
modify_job (req_modify.c) and once from svr_enquejob (svr_jobfunc.c).
Anyway, the way to fix this is to simply unset procct only if the
queue is an execution queue. And this can be done in set_resc_deflt
instead of in svr_chkque.

There is one bizarre issue: the at_action function for procs in
resc_def_all.c has disappeared while the function "set_proc_ct"
appears in the definition of the sds resource. I am assuming that this
is unintended and a patch somehow went awry.

New patch attached (for torque-2.5.6-snap.201104191030).

Cheers,
Martin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: torque-2.5.6-snap.201104191030-procct.patch
Type: text/x-patch
Size: 1989 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20110426/9ef4a346/attachment.bin 


More information about the torquedev mailing list