[torqueusers] torque does not kill jobs when wall_time or cpu_time reached
glen.beane at gmail.com
Tue Jun 8 17:54:39 MDT 2010
On Jun 8, 2010, at 6:37 PM, David Singleton
<David.Singleton at anu.edu.au> wrote:
> On 06/09/2010 07:29 AM, Glen Beane wrote:
>> in my opinion JOBNODEMATCHPOLICY EXACTNODE should now be the default
>> behavior since we have -l procs. If I ask for 5 node and 8
>> per node then that is what I should get. I don't want 10 nodes with 4
>> processors or 2 nodes with 16 processors and 1 with 8, etc. If
>> don't care about the layout of their job they can use -l procs.
>> hopefully with select things will be less ambiguous and will allow
>> greater flexibility (let the user be precise as they want, but also
>> allow some way to say I don't care, just give me X processors).
> Our experience is that very few users want detailed control over
> how many physical nodes they get - it seems to be only comp sci
> or similar with mistaken ideas about the value of such control. They
> dont seem to realise that when they demand 1 cpu from each of 16
> variability in what is running on the other cpus on those nodes will
> a mockery of any performance numbers they deduce. Other reasons for
> requesting exact nodes are usually to do with another resource
> network interfaces, GPUs, ...). It should be requests for those
> node properties that get what the user wants, not the number of nodes.
> We certainly have more users with hydrid MPI-OpenMP codes and for
> nodes are really "virtual nodes", eg. a request for -lnodes=8:ppn=4
> the job will be running with 8 MPI tasks each of which will have 4
> threads -
> the job needs any (the best?) set of cpus that can run that. A 32P
> might a perfectly acceptable solution.
Select takes care of this. You request 8 task with 4 virtual procs per
task. The scheduler can co-locate tasks. However if I go through the
trouble of requesting a specific number of nodes then I should get them.
Replying from my phone so ignore the rest of this email. It is a pain
to delete what I'm not commenting on.
> I suspect hybrid codes will become more common.
> So I would suggest EXACTNODE should not be the default but rather that
> users thinking they want such detailed control should have to
> specify some
> other option to show this (eg. -lother=exactnodes), ie. nodes are
> "virtual nodes" unless the user specifies otherwise.
>> Also, the documentation should be clear that when you request a
>> of processors per node (ppn) or a number of processors (procs) it is
>> talking about virtual processors as configured in pbs_server
> Note that virtual processors != physical processors causes a number of
> problems. Certainly cpuset-aware MOMs are going to barf with such a
> and the problem is that they dont know this is the config, only the
> and scheduler do. It sorta makes sense for the number of virtual
> to be set in the MOM's config file so it can shut down NUMA/cpuset/
> code when it doesn't make sense.
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers