[torquedev] New 2.5.6 snapshot

Martin Siegert siegert at sfu.ca
Wed Apr 27 13:51:48 MDT 2011


Hi Ken,

in that case compilation fails with:

resc_def_all.c:464: warning: initialization from incompatible pointer type
make[2]: *** [resc_def_all.o] Error 1

(there are a lot of other warnings - mostly gpu related - that need to
be fixed before getting to this point. I can send you those separately).

- Martin

On Wed, Apr 27, 2011 at 12:23:49PM -0600, Ken Nielson wrote:
> Martin,
> 
> If you still have the code with the set_proc_ct in it would you please re-run configure with the --enable-gcc-warnings option?
> 
> Regards
> 
> Ken
> 
> ----- Original Message -----
> From: "Martin Siegert" <siegert at sfu.ca>
> To: "David Beer" <dbeer at adaptivecomputing.com>, "Torque Developers mailing list" <torquedev at supercluster.org>
> Sent: Tuesday, April 26, 2011 2:25:06 PM
> Subject: Re: [torquedev] New 2.5.6 snapshot
> 
> On Wed, Apr 20, 2011 at 06:21:10PM -0700, Martin Siegert wrote:
> > On Tue, Apr 19, 2011 at 10:33:21AM -0600, David Beer wrote:
> > > 
> > > 
> > > ----- Original Message -----
> > > > Hi,
> > > > 
> > > > On Thu, Apr 07, 2011 at 05:05:04PM -0600, Ken Nielson wrote:
> > > > > There is a new snapshot for 2.5.6 available. This fixes a problem
> > > > > with
> > > > > a patch for Bugzilla 116 where the new resource procct was added. If
> > > > > the
> > > > > -l nodes option was not used in a job submission then the job would
> > > > > not
> > > > > be run by Moab because procct was added to the Resource_List
> > > > > attribute
> > > > > and treated like a generic resource by Moab. Because the generic
> > > > > resource
> > > > > procct does not exist Moab never schedules the job.
> > > > >
> > > > > This is now fixed.
> > > > >
> > > > > You can download this snapshot at
> > > > > http://www.clusterresources.com/downloads/torque/snapshots/torque-2.5.6-snap.201104071657.tar.gz
> > > > >
> > > > > Please download and let us know if you find any problems.
> > > > 
> > > > I am afraid this does not work: I haven't traced this back to the
> > > > source routine, but apparently this new version presets the nodes
> > > > resource to 1, correct?
> > > > Thus, if a user only requests -l procs=N, with 2.5.6-snap.201104071657
> > > > procct is set to N+1, not N, see
> > > > 
> > > > resc_def_all.c, line 1118:
> > > > 
> > > > ppct->rs_value.at_val.at_long =
> > > > count_proc(pnodesp->rs_value.at_val.at_str)
> > > > + pprocsp->rs_value.at_val.at_long;
> > > > 
> > > > torque-2.5.6-snap.201104041023 actually worked flawlessly for me.
> > > > Which means that I haven't figured out how to trigger the bug that
> > > > torque-2.5.6-snap.201104071657 was supposed to fix.
> > > > Regardless of whether I specified -l nodes=... or -l procs=... or
> > > > neither moab always started my job, i.e., the procct resource
> > > > always got removed before the job was sent to moab, see,
> > > > 
> > > > svr_jobfunc.c, line 1965:
> > > > 
> > > > if (strcmp(pque->qu_attr->at_val.at_str, "Execution") == 0)
> > > > {
> > > > /* job routed to Execution queue successfully */
> > > > /* unset job's procct resource */
> > > > resource_def *pctdef;
> > > > resource *pctresc;
> > > > pctdef = find_resc_def(svr_resc_def, "procct", svr_resc_size);
> > > > if ((pctresc = find_resc_entry(&pjob->ji_wattr[JOB_ATR_resource],
> > > > pctdef)) != NULL)
> > > > pctdef->rs_free(&pctresc->rs_value);
> > > > }
> > > > }
> > > > 
> > > > If somebody can explain to me how to submit a job that is not caught
> > > > in
> > > > this if block, I may be able to fix this.
> > > > 
> > > 
> > > This issue is now resolved. The problem was where no resource was requested and then the nodes request was applied by default. This was resolved by adding code to free the resource after queue and server defaults are applied. The new snapshot can be found here:
> > > 
> > > http://www.clusterresources.com/downloads/torque/snapshots/torque-2.5.6-snap.201104191030.tar.gz
> > 
> > Sorry this does not work either:
> > The new code in set_resc_deflt (in svr_jobfunc.c)
> > 
> >   /* unset the procct resource if it has been set */
> >   pctdef = find_resc_def(svr_resc_def, "procct", svr_resc_size);
> > 
> >   if ((pctresc = find_resc_entry(ja, pctdef)) != NULL)
> >     pctdef->rs_free(&pctresc->rs_value);
> > 
> > unsets procct before the routing is done. Which basically disables
> > all of the procct routing code and makes procct completely useless.
> > 
> > But I have been able to reproduce the bug that occurs when setting
> > 
> > set server resources_default.nodes = 1
> > 
> > and submitting a job that requests no resources.
> > I'll try to fix this.
> 
> I believe the problem is that set_resc_deflt is called twice: once from
> modify_job (req_modify.c) and once from svr_enquejob (svr_jobfunc.c).
> Anyway, the way to fix this is to simply unset procct only if the
> queue is an execution queue. And this can be done in set_resc_deflt
> instead of in svr_chkque.
> 
> There is one bizarre issue: the at_action function for procs in
> resc_def_all.c has disappeared while the function "set_proc_ct"
> appears in the definition of the sds resource. I am assuming that this
> is unintended and a patch somehow went awry.
> 
> New patch attached (for torque-2.5.6-snap.201104191030).
> 
> Cheers,
> Martin
> 
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev


More information about the torquedev mailing list