[torqueusers] torque does not kill jobs when wall_time or cpu_time reached

David Singleton David.Singleton at anu.edu.au
Fri Jun 4 15:37:46 MDT 2010


On 06/05/2010 05:52 AM, Martin Siegert wrote:
> On Fri, Jun 04, 2010 at 03:36:06PM +0200, Arnau Bria wrote:
>> On Fri, 04 Jun 2010 13:30:59 +0200
>> Mgr. Šimon Tóth wrote:
>>
>> Hi Simon,
>>
>>>> 1.-) correct a bug in  src/include/pbs_config.h.in
>>>>   RESOURCEMAXDEFAULT insted of  RESOURCEMAXNOTDEFAULT
>>>> 2.- ) enable --enable-maxdefault at configure time
>>>>
>>>>
>>>> and doc should be updated.
>>>
>>> That wouldn't make much sense. Max is max for submit and that's the
>>> way it should be. The problem is that server doesn't reject jobs with
>>> infinite requirements when the max is set.
>>
>> I don't know if I've understood you, but I think we agree :-)
>>
>> If a max or default is set at queue level, all jobs from that queue
>> should take those values by default. are you saying so?
>>
>> I'd like to hear some devel opinion on that, I'm sure there must be a
>> good reason for changing previous (2.3) behaviour.
>
> There was a very good reason, see
> http://www.clusterresources.com/pipermail/torqueusers/2010-January/009852.html
>
> It has been a while, but we got severely bitten by the torque behaviour
> to use resources_max.xyz as defaults. This is roughly what happened:
>
> we had set
>
> set queue qs resources_max.procs = 128
>
> and a user submitted a job with -l nodes=42
>
> The job got submitted into the qs queue. But when moab was restarted
> torque server resent the job resources, but since there where no
> procs resources requested by the job and no defaults where set for
> procs either torque sent "128" as the procs resource for that job,
> which moab happily combined with the nodes resource which then resulted
> in a processor count larger than the maximum of 128 for that queue
> which in turn caused moab to remove the job from the queue.
> We had users' jobs disappearing from the queue because torque modified
> their resources.
>
> Clearly, using resources_max.procs as the default if no procs resource
> is requested is nonsense (since the user can request the nodes resource
> and the procs resource overrides it).
>
> Thus, while I do not dispute that the resources_max.xyz should be
> honoured at job submission time by qsub (and if that does not happen,
> I agree that that is a bug), I do not agree with reverting to the
> previous behaviour that used max values as defaults; that's in my
> opinion a bug as well.
>
> We have completely converted to configuring torque with
> --disable-maxdefault
> to prevent torque from changing job resources.
>

I dont think the problem you had is one of max vs default.  I think it is
one of a poorly defined (either non-orthogonal or non-aligned) set of
node/ncpus/proc/.. like resources to request.   The main problem seems to
be that moab treats procs as processors while the documentation, comments
and code related to procs in Torque (same as the OpenPBS code) treats it
as processes (i.e. it sets RLIMIT_NPROC in a few MOMs).  You cant blame
Torque/PBS for the misuse of a resource by moab.

If procs is going to mean processors/cpus then I would suggest there needs
to be a lot of code added to align nodes and procs - they are specifying
the same thing.

David








More information about the torqueusers mailing list