[torqueusers] torque does not kill jobs when wall_time or cpu_time reached

Martin Siegert siegert at sfu.ca
Fri Jun 4 13:52:49 MDT 2010


On Fri, Jun 04, 2010 at 03:36:06PM +0200, Arnau Bria wrote:
> On Fri, 04 Jun 2010 13:30:59 +0200
> Mgr. Šimon Tóth wrote:
> 
> Hi Simon,
> 
> > > 1.-) correct a bug in  src/include/pbs_config.h.in  
> > >  RESOURCEMAXDEFAULT insted of  RESOURCEMAXNOTDEFAULT
> > > 2.- ) enable --enable-maxdefault at configure time
> > > 
> > > 
> > > and doc should be updated.
> > 
> > That wouldn't make much sense. Max is max for submit and that's the
> > way it should be. The problem is that server doesn't reject jobs with
> > infinite requirements when the max is set.
>  
> I don't know if I've understood you, but I think we agree :-)
> 
> If a max or default is set at queue level, all jobs from that queue
> should take those values by default. are you saying so?
> 
> I'd like to hear some devel opinion on that, I'm sure there must be a
> good reason for changing previous (2.3) behaviour.

There was a very good reason, see
http://www.clusterresources.com/pipermail/torqueusers/2010-January/009852.html

It has been a while, but we got severely bitten by the torque behaviour
to use resources_max.xyz as defaults. This is roughly what happened:

we had set

set queue qs resources_max.procs = 128

and a user submitted a job with -l nodes=42

The job got submitted into the qs queue. But when moab was restarted
torque server resent the job resources, but since there where no
procs resources requested by the job and no defaults where set for
procs either torque sent "128" as the procs resource for that job,
which moab happily combined with the nodes resource which then resulted
in a processor count larger than the maximum of 128 for that queue
which in turn caused moab to remove the job from the queue.
We had users' jobs disappearing from the queue because torque modified
their resources.

Clearly, using resources_max.procs as the default if no procs resource
is requested is nonsense (since the user can request the nodes resource
and the procs resource overrides it).

Thus, while I do not dispute that the resources_max.xyz should be
honoured at job submission time by qsub (and if that does not happen,
I agree that that is a bug), I do not agree with reverting to the
previous behaviour that used max values as defaults; that's in my
opinion a bug as well.

We have completely converted to configuring torque with
--disable-maxdefault
to prevent torque from changing job resources.

Cheers,
Martin

-- 
Martin Siegert
Head, Research Computing
WestGrid Site Lead
IT Services                                phone: 778 782-4691
Simon Fraser University                    fax:   778 782-4242
Burnaby, British Columbia                  email: siegert at sfu.ca
Canada  V5A 1S6


More information about the torqueusers mailing list