[torqueusers] torque does not kill jobs when wall_time or cpu_time reached
knielson at adaptivecomputing.com
Tue Jun 8 16:44:59 MDT 2010
On 06/08/2010 04:37 PM, David Singleton wrote:
> On 06/09/2010 07:29 AM, Glen Beane wrote:
>> in my opinion JOBNODEMATCHPOLICY EXACTNODE should now be the default
>> behavior since we have -l procs. If I ask for 5 node and 8 processors
>> per node then that is what I should get. I don't want 10 nodes with 4
>> processors or 2 nodes with 16 processors and 1 with 8, etc. If people
>> don't care about the layout of their job they can use -l procs.
>> hopefully with select things will be less ambiguous and will allow for
>> greater flexibility (let the user be precise as they want, but also
>> allow some way to say I don't care, just give me X processors).
> Our experience is that very few users want detailed control over exactly
> how many physical nodes they get - it seems to be only comp sci students
> or similar with mistaken ideas about the value of such control. They
> dont seem to realise that when they demand 1 cpu from each of 16 nodes,
> variability in what is running on the other cpus on those nodes will make
> a mockery of any performance numbers they deduce. Other reasons for
> requesting exact nodes are usually to do with another resource (memory,
> network interfaces, GPUs, ...). It should be requests for those resources/
> node properties that get what the user wants, not the number of nodes.
> We certainly have more users with hydrid MPI-OpenMP codes and for them,
> nodes are really "virtual nodes", eg. a request for -lnodes=8:ppn=4 means
> the job will be running with 8 MPI tasks each of which will have 4 threads -
> the job needs any (the best?) set of cpus that can run that. A 32P SMP
> might a perfectly acceptable solution.
> I suspect hybrid codes will become more common.
> So I would suggest EXACTNODE should not be the default but rather that
> users thinking they want such detailed control should have to specify some
> other option to show this (eg. -lother=exactnodes), ie. nodes are
> "virtual nodes" unless the user specifies otherwise.
tpn is an option on some systems which would be used something like -l
nodes=2:tpn=4 which would say I want two separate hosts with exactly
four processors from each host.
>> Also, the documentation should be clear that when you request a number
>> of processors per node (ppn) or a number of processors (procs) it is
>> talking about virtual processors as configured in pbs_server
More information about the torqueusers