[torqueusers] torque does not kill jobs when wall_time or cpu_time reached
siegert at sfu.ca
Fri Jun 4 16:29:11 MDT 2010
On Sat, Jun 05, 2010 at 07:37:46AM +1000, David Singleton wrote:
> On 06/05/2010 05:52 AM, Martin Siegert wrote:
> > On Fri, Jun 04, 2010 at 03:36:06PM +0200, Arnau Bria wrote:
> >> On Fri, 04 Jun 2010 13:30:59 +0200
> >> Mgr. Šimon Tóth wrote:
> >> Hi Simon,
> >>>> 1.-) correct a bug in src/include/pbs_config.h.in
> >>>> RESOURCEMAXDEFAULT insted of RESOURCEMAXNOTDEFAULT
> >>>> 2.- ) enable --enable-maxdefault at configure time
> >>>> and doc should be updated.
> >>> That wouldn't make much sense. Max is max for submit and that's the
> >>> way it should be. The problem is that server doesn't reject jobs with
> >>> infinite requirements when the max is set.
> >> I don't know if I've understood you, but I think we agree :-)
> >> If a max or default is set at queue level, all jobs from that queue
> >> should take those values by default. are you saying so?
> >> I'd like to hear some devel opinion on that, I'm sure there must be a
> >> good reason for changing previous (2.3) behaviour.
> > There was a very good reason, see
> > http://www.clusterresources.com/pipermail/torqueusers/2010-January/009852.html
> > It has been a while, but we got severely bitten by the torque behaviour
> > to use resources_max.xyz as defaults. This is roughly what happened:
> > we had set
> > set queue qs resources_max.procs = 128
> > and a user submitted a job with -l nodes=42
> > The job got submitted into the qs queue. But when moab was restarted
> > torque server resent the job resources, but since there where no
> > procs resources requested by the job and no defaults where set for
> > procs either torque sent "128" as the procs resource for that job,
> > which moab happily combined with the nodes resource which then resulted
> > in a processor count larger than the maximum of 128 for that queue
> > which in turn caused moab to remove the job from the queue.
> > We had users' jobs disappearing from the queue because torque modified
> > their resources.
> > Clearly, using resources_max.procs as the default if no procs resource
> > is requested is nonsense (since the user can request the nodes resource
> > and the procs resource overrides it).
> > Thus, while I do not dispute that the resources_max.xyz should be
> > honoured at job submission time by qsub (and if that does not happen,
> > I agree that that is a bug), I do not agree with reverting to the
> > previous behaviour that used max values as defaults; that's in my
> > opinion a bug as well.
> > We have completely converted to configuring torque with
> > --disable-maxdefault
> > to prevent torque from changing job resources.
> I dont think the problem you had is one of max vs default. I think it is
> one of a poorly defined (either non-orthogonal or non-aligned) set of
> node/ncpus/proc/.. like resources to request. The main problem seems to
> be that moab treats procs as processors while the documentation, comments
> and code related to procs in Torque (same as the OpenPBS code) treats it
> as processes (i.e. it sets RLIMIT_NPROC in a few MOMs). You cant blame
> Torque/PBS for the misuse of a resource by moab.
As far as I recall procs was introduced for the sole purpose to be
passed on to moab. Specifying procs only makes (or made) sense when
using moab. It was implemented because torque did not allow users to
specify "I want N processors anywhere on the cluster regardless of
their distribution across nodes".
Anyways, my understanding is that torque, if resource xyz is not
specified, did send resources_max.xyz to the scheduler as if this would
be a requested resource. Which broke our queue/class setup.
> If procs is going to mean processors/cpus then I would suggest there needs
> to be a lot of code added to align nodes and procs - they are specifying
> the same thing.
Frankly: I don't care. If torque sends all requests to the scheduler
to let the scheduler handle it, that's just fine with me.
More information about the torqueusers